In [4]:
pip install datasets

Collecting datasets
  Downloading datasets-4.4.2-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-win_amd64.whl.metadata (3.3 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.6.0-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Downloading multiprocess-0.70.18-py312-none-any.whl.metadata (7.5 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Downloading datasets-4.4.2-py3-none-any.whl (512 kB)
Downloading multiprocess-0.70.18-py312-none-any.whl (150 kB)
Downloading dill-0.4.0-py3-none-any.whl (119 kB)
Downloading pyarrow-22.0.0-cp312-cp312-win_amd64.whl (28.0 MB)
   ---------------------------------------- 0.0/28.0 MB ? eta -:--:--
   -- ------------------------------------- 2.1/28.0 MB 10.7 MB/s eta 0:00:03
   ------ --------------------------------- 4.7/28.0 MB 11.4 MB/s eta 0:00:03
   ---------- ------

In [17]:
from datasets import load_dataset

dataset = load_dataset("mteb/tweet_sentiment_extraction")
df = pd.DataFrame(dataset['train'])

## NOTE
#dataset  = load_dataset("---") , here the dataset isnt a DDADTAFRAME, it's this:

 DatasetDict({
    train: Dataset(...),
    test: Dataset(...)
 })

So:

dataset = a collection of splits

Each split (train, test, etc.) is its own dataset

the training split :  dataset['train']
so, dataset['train']  returns a Hugging Face Dataset object, not a pandas DataFrame.
dataset['train'] behaves like a list of dictionaries, 
each row is a dictionary like this : 
  {
    'text': 'I love this!',
    'sentiment': 'positive',
    ...
  }

df = pd.DataFrame(dataset['train']) #heere we convert that splkit to a dataFrame

In [16]:

print(dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 26732
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 3432
    })
})


In [20]:
df.head()

Unnamed: 0,id,text,label,label_text
0,cb774db0d1,"I`d have responded, if I were going",1,neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,0,negative
2,088c60f138,my boss is bullying me...,0,negative
3,9642c003ef,what interview! leave me alone,0,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",0,negative


## Step 3: Tokenizer
Now that we already have our dataset, we need a tokenizer to prepare it to be parsed by our model.

As LLMs work with tokens, we require a tokenizer to process the dataset. To process your dataset in one step, use the Datasets map method to apply a preprocessing function over the entire dataset.

This is why the second step is to load a pre-trained Tokenizer and tokenize our dataset so it can be used for fine-tuning.

In [None]:
#tokenizer
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt-2")
tokenizer.pad_token = tokenizer.eos_token

In [None]:
#tokenizer
from transformers import GPT2Tokenizer

# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

## explaination 
1Ô∏è. What is a tokenizer? (Intuition first)

LLMs do not understand text.They only understand numbers. So we need a way to convert:
"I love NLP" into something like:
[40, 1842, 318]
That converter is called a tokenizer.

### What is a token?
A token is not always a word.
Depending on the tokenizer, a token can be:
A word: "love"
A subword: "lov" + "e"
A character: "l", "o", "v", "e"

Or even punctuation and spaces
Example (GPT-2 tokenizer):

"I love NLP"
‚Üí ["I", "ƒ†love", "ƒ†NL", "P"]
‚Üí [40, 1842, 16906, 47]
(The ƒ† means ‚Äúspace before this token‚Äù)

### What does a tokenizer produce?
Usually three things:
  input_ids ‚Üí numerical representation of tokens
  attention_mask ‚Üí which tokens are real vs padding
  (sometimes) token_type_ids

Example output:

    {
      "input_ids": [40, 1842, 318, 50256, 50256],
      "attention_mask": [1, 1, 1, 0, 0]
    }

## 2Ô∏è. Why do we tokenize when fine-tuning an LLM?
üî¥ Raw text cannot be used by neural networks

Neural networks only operate on numbers.
So this is impossible:
model("I love NLP")  ‚ùå
This is required:
model([40, 1842, 318])  ‚úÖ

During fine-tuning, tokenization ensures:
1. Text becomes model-readable
Tokens ‚Üí embeddings ‚Üí transformer layers

2. Consistency with pretraining
You must use the same tokenizer the model was trained with.
GPT-2 was trained with: "GPT2Tokenizer"
Using a different tokenizer would:
  Break learned embeddings
  Destroy model performance

##  3. Fixed input size (padding & truncation)
Models expect uniform tensor sizes in batches.
So we:
  Pad shorter sentences
  Truncate longer ones
Example:
Max length = 10

"I love NLP"
‚Üí [I, love, NLP, <PAD>, <PAD>, ...]

## 4. Efficient batching on GPU

Same length ‚Üí faster matrix operations ‚Üí efficient training

## 3Ô∏è. Now let‚Äôs explaination of the code
üîπ Import tokenizer

  `from transformers import GPT2Tokenizer
You import the tokenizer used by GPT-2.
This tokenizer:
    Uses Byte Pair Encoding (BPE)
    Has a fixed vocabulary
    Matches GPT-2‚Äôs embeddings

üîπ Load dataset
  dataset = load_dataset("mteb/tweet_sentiment_extraction")

This loads a DatasetDict:
    {
      "train": Dataset,
      "test": Dataset
    }

Each row contains:

    {
      "text": "...",
      "sentiment": "positive"
    }

üîπ Load pretrained tokenizer
    
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

What happens here:
  Downloads GPT-2 tokenizer config

Loads:
  Vocabulary
  Merge rules
  Special tokens
Now the tokenizer knows how GPT-2 splits text.

üîπ Set padding token

    tokenizer.pad_token = tokenizer.eos_token
Why is this needed?
GPT-2 does not have a pad token by default.
But padding is required for batching.
So we reuse:
  <eos> (end-of-sequence)
as padding.

This is very common for GPT-style models.

üîπ Define tokenization function

    def tokenize_function(examples):
      return tokenizer(examples["text"], padding="max_length", truncation=True)
What is examples?
Because of batched=True, examples looks like:

    {
      "text": [
        "I love this movie",
        "This is bad"
      ]
    }

What does this line do?

    tokenizer(
      examples["text"],
      padding="max_length",
      truncation=True
    )

It:

    Converts text ‚Üí tokens
    Converts tokens ‚Üí IDs
    Pads all sequences to max length
    Truncates if too long

Returns:

    {
      "input_ids": [...],
      "attention_mask": [...]
    }

üîπ Apply tokenization to entire dataset

    tokenized_datasets = dataset.map(tokenize_function, batched=True)

What happens internally?
Hugging Face:

    Iterates over train and test
    Applies tokenize_function
    Adds new columns

Now each row looks like:

      {
        "text": "I love this!",
        "sentiment": "positive",
        "input_ids": [...],
        "attention_mask": [...]
      }

üîπ Why use .map()?

Because it:
Is fast
Is memory-efficient
Works directly on datasets
Avoids pandas overhead

This is preferred over manually looping.

Final mental model
  Raw text
    ‚Üì
  Tokenizer
    ‚Üì
  Token IDs + Attention Mask
    ‚Üì
  Model embeddings
    ‚Üì
  Transformer layers
    ‚Üì
  Fine-tuned LLM