# Tokenization

- [HuggingFase Datasets](https://huggingface.co/docs/datasets) is a library for easily accessing and sharing datasets.
- [HuggingFace Tokenizers](https://huggingface.co/docs/tokenizers) is an implementation of today's most used tokenizers, with a focus on performance and versatility.

# Load Dataset

In [18]:
from datasets import load_dataset

In [None]:
# Comp: https://www.kaggle.com/competitions/tweet-sentiment-extraction/overview
dataset = load_dataset("mteb/tweet_sentiment_extraction")

In [None]:
import pandas as pd

In [None]:
df_train = pd.DataFrame({
    "text": dataset["train"]["text"],
    "label": dataset["train"]["label"]
})

In [None]:
df_test = pd.DataFrame({
    "text": dataset["test"]["text"],
    "label": dataset["test"]["label"]
})

In [None]:
df_train

In [None]:
df_test

In [None]:
df_train["label"].unique()

# Using Pretrained Tokenizer

In [None]:
from transformers import AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1")

In [None]:
tokens = tokenizer.encode(df_train["text"][5200])
print(tokens)

[0, 76269, 304, 342, 13969, 35, 418, 270, 4772, 500, 832, 1225, 588, 15391, 362, 969, 23594, 16, 342, 13969, 35, 344, 295, 1277, 1026, 2656, 5147, 16765, 396, 342, 3518]


In [None]:
orign_text = tokenizer.decode(tokens)
print(orign_text)

<｜begin▁of▁sentence｜>Going to IKEA with the roomie so she can shop for her apartment. IKEA is in like my top ten stores that I love


In [None]:
print(len(tokenizer.vocab))

128815


In [None]:
texts = [
    "DeepSeek’s tokenizer works well on English.",
    "Batch tokenization is straightforward with 🤗 Transformers. But the string must be long to show attention mask"
]

# Batch processing
batch = tokenizer(
    texts,
    padding=True,        # pad to the longest sequence in the batch
    truncation=True,     # truncate sequences that exceed model’s max length
    max_length=256,      # optional: set an explicit limit
    return_tensors="pt"  # return PyTorch tensors (use "tf" for TensorFlow)
)

print(batch["input_ids"])      # tensor of token IDs
print(batch["attention_mask"]) # tensor indicating which tokens are padding

tensor([[     1,      1,      1,      1,      1,      1,      1,      1,      1,
              0,  53091,   4374,   1465,    442,     85,  17840,   9160,   2984,
           1585,    377,   3947,     16],
        [     0,  83469,  17840,   1878,    344,  28179,    418, 112838,    248,
          38178,    387,     16,   2275,    270,   3418,   2231,    366,   1606,
            304,   1801,   5671,  16496]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


# Training a Text Classifier (FastAI)

In [1]:
# Download an IMDB dataset
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [2]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

In [3]:
datablock = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
)
dls_lm = datablock.dataloaders(path, path=path, bs=128, seq_len=80)

In [4]:
# Create a dataset for training. Dependent variable is one token ahead of a dependent variable
dls_lm.show_batch(max_n=3)

Unnamed: 0,text,text_
0,"xxbos xxmaj certain elements of this film are dated , of course . xxmaj an all white male crew , for instance . xxmaj and like most pre - star xxmaj wars xxmaj science xxmaj fiction , it tends to take too long admiring itself . \n\n xxmaj but , still , no movie has ever capture the flavor of xxmaj golden xxmaj age xxmaj science xxmaj fiction as this one did , even down to the use of the","xxmaj certain elements of this film are dated , of course . xxmaj an all white male crew , for instance . xxmaj and like most pre - star xxmaj wars xxmaj science xxmaj fiction , it tends to take too long admiring itself . \n\n xxmaj but , still , no movie has ever capture the flavor of xxmaj golden xxmaj age xxmaj science xxmaj fiction as this one did , even down to the use of the """
1,""" womans xxmaj choice "" ! xxmaj this film will provoke you to reconsider . \n\n xxmaj even though most of the actors have only been in a few films , you will wonder why they have not been cast more often . \n\n xxmaj if you watch this film and are not challenged by its thought provoking message , you need to watch it again because you did not pay attention the first time . xxbos xxmaj perhaps xxmaj","womans xxmaj choice "" ! xxmaj this film will provoke you to reconsider . \n\n xxmaj even though most of the actors have only been in a few films , you will wonder why they have not been cast more often . \n\n xxmaj if you watch this film and are not challenged by its thought provoking message , you need to watch it again because you did not pay attention the first time . xxbos xxmaj perhaps xxmaj i"
2,"in la la land when the rangers jump out of a xxmaj hercules transport at dawn somewhere over the mideast , but then after a water landing they surface in the dark ! xxmaj the continuity errors continue xxunk the pic , costumes and make up change multiple times within scenes . xxmaj but ya know what , it does'nt matter ! xxmaj the script is even more ludicrous . xxmaj after the xxmaj rangers capture a terrorist and bring","la la land when the rangers jump out of a xxmaj hercules transport at dawn somewhere over the mideast , but then after a water landing they surface in the dark ! xxmaj the continuity errors continue xxunk the pic , costumes and make up change multiple times within scenes . xxmaj but ya know what , it does'nt matter ! xxmaj the script is even more ludicrous . xxmaj after the xxmaj rangers capture a terrorist and bring him"


In [5]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()

The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab).

![Alt Text](https://raw.githubusercontent.com/fastai/fastbook/e8baa81d89f0b7be816e35f1cc813ac02038db54/images/att_00027.png)

The first arrow has been completed for us and made available as a pretrained model in fastai, and we've just built the DataLoaders and Learner for the second stage. Now we're ready to fine-tune our language model!

`language_model_learner` automatically calls `freeze` when using a pretrained model (which is the default), so this will only train the embeddings (the only part of the model that contains randomly initialized weights—i.e., embeddings for words that are in our IMDb vocab, but aren't in the pretrained model vocab):

In [6]:
learn.fit_one_cycle(1, 2e-2)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.007,3.896233,0.30081,49.216682,09:49


Once the initial training has completed, we can continue fine-tuning the model after unfreezing:

In [7]:
learn.unfreeze()
learn.fit_one_cycle(5, 2e-3)  # "fit_one_cycle" allows to save a model after each epoch

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.761386,3.750302,0.318114,42.53392,10:07
1,3.675211,3.656346,0.328566,38.719604,09:52
2,3.562186,3.599485,0.33525,36.579395,09:54
3,3.447156,3.568553,0.339424,35.465252,09:42
4,3.365878,3.568156,0.340122,35.451153,09:34


Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary. The model not including the final layer is called the encoder. We can save it with `save_encoder`:

In [8]:
learn.save('finetuned')
learn.save_encoder('finetuned_encoder')

In [None]:
# We can check how we did with Text Generation:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
         for _ in range(N_SENTENCES)]

In [9]:
print("\n".join(preds))

i liked this movie because it has one of the best performances I 've seen in a long time . It 's a good movie to see if you want to get your heart pumping , or if you 're a fan of
i liked this movie because of the fact that it was longer . It 's about a teenager , and does n't realize what he 's doing . This movie plays on with the spirit of the father . It 's not


# Classifier Dataloader

In [10]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [11]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj by now you 've probably heard a bit about the new xxmaj disney dub of xxmaj miyazaki 's classic film , xxmaj laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky . xxmaj during late summer of 1998 , xxmaj disney released "" kiki 's xxmaj delivery xxmaj service "" on video which included a preview of the xxmaj laputa dub saying it was due out in "" 1 xxrep 3 9 "" . xxmaj it 's obviously way past that year now , but the dub has been finally completed . xxmaj and it 's not "" laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky "" , just "" castle xxmaj in xxmaj the xxmaj sky "" for the dub , since xxmaj laputa is not such a nice word in xxmaj spanish ( even though they use the word xxmaj laputa many times",pos
2,"xxbos xxmaj some have praised _ xxunk _ as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n\n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the "" crack staff "" of many older adventure movies has been done well before , ( think _ the xxmaj dirty xxmaj dozen _ ) but _ atlantis _ represents one of the worse films in that motif . xxmaj the characters are weak . xxmaj even the background that each member trots out seems stock and awkward at best . xxmaj an xxup md / xxmaj medicine xxmaj man , a tomboy mechanic whose father always wanted sons , if we have not at least seen these before",neg


In [12]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
                                metrics=accuracy).to_fp16()

In [13]:
# We can now load encoder since we are going to add a classification head.
learn = learn.load_encoder('/content/finetuned_encoder')

# Fine-Tuning the Classifier

In [14]:
learn.fit_one_cycle(1, 2e-2)

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.294827,0.224858,0.91072,00:15


In [15]:
# -2 - freeze all except the last two parameters.
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.262539,0.206517,0.91864,00:18


In [16]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.22858,0.181586,0.93096,00:20


In [17]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,accuracy,time
0,0.197066,0.173745,0.93336,00:24
1,0.175426,0.17397,0.93516,00:24
