<a href="https://colab.research.google.com/github/CrisMcode111/DI_Bootcamp/blob/main/w6_d3_daillychallengePreprocess_%26_fine_tune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

👩‍🏫 👩🏿‍🏫 What You’ll learn
In this daily challenge, you will learn how to preprocess and fine-tune transformer-based models, specifically BERT and XLM-RoBERTa, for text classification tasks. You will gain an understanding of:

How tokenization works for these models.
How to properly format input data.
How to fine-tune transformer models for classification tasks.
How to perform cross-validation using k-fold splitting.


🛠️ What you will create
By the end of this challenge, you will have a fine-tuned transformer model (BERT or XLM-RoBERTa) capable of classifying text into different categories. Additionally, you will structure the data for training, validate it using cross-validation, and understand how to optimize these models for better performance.



Dataset
You can find the dataset for this exercise here



Task


1. Understanding BERT and XLM-RoBERTa
Objective: Learn how transformer models work and their role in NLP tasks.

Instructions:

Read through the descriptions of BERT and XLM-RoBERTa.

In [2]:
!wget https://github.com/devtlv/Datasets-GEN-AI-Bootcamp/raw/refs/heads/main/Week%206/W6D1%20GenAi%20France/Basics%20of%20BERT%20and%20XLM-RoBERTa%20-%20PyTorch%20-%202.zip

--2025-11-12 13:49:58--  https://github.com/devtlv/Datasets-GEN-AI-Bootcamp/raw/refs/heads/main/Week%206/W6D1%20GenAi%20France/Basics%20of%20BERT%20and%20XLM-RoBERTa%20-%20PyTorch%20-%202.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/devtlv/Datasets-GEN-AI-Bootcamp/refs/heads/main/Week%206/W6D1%20GenAi%20France/Basics%20of%20BERT%20and%20XLM-RoBERTa%20-%20PyTorch%20-%202.zip [following]
--2025-11-12 13:49:58--  https://media.githubusercontent.com/media/devtlv/Datasets-GEN-AI-Bootcamp/refs/heads/main/Week%206/W6D1%20GenAi%20France/Basics%20of%20BERT%20and%20XLM-RoBERTa%20-%20PyTorch%20-%202.zip
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443.

In [3]:
!unzip 'Basics of BERT and XLM-RoBERTa - PyTorch - 2'

Archive:  Basics of BERT and XLM-RoBERTa - PyTorch - 2.zip
  inflating: Basics of BERT and XLM-RoBERTa - PyTorch/sample_submission.csv  
 extracting: Basics of BERT and XLM-RoBERTa - PyTorch/test.csv.zip  
 extracting: Basics of BERT and XLM-RoBERTa - PyTorch/train.csv.zip  


In [4]:
!unzip 'Basics of BERT and XLM-RoBERTa - PyTorch/train.csv.zip' -d dataset/
!unzip 'Basics of BERT and XLM-RoBERTa - PyTorch/test.csv.zip' -d dataset/


Archive:  Basics of BERT and XLM-RoBERTa - PyTorch/train.csv.zip
  inflating: dataset/train.csv       
Archive:  Basics of BERT and XLM-RoBERTa - PyTorch/test.csv.zip
  inflating: dataset/test.csv        


In [5]:
import pandas as pd

train = pd.read_csv('dataset/train.csv')
test = pd.read_csv('dataset/test.csv')
print(train.shape, test.shape)
train.head()


(12120, 6) (5195, 5)


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1


# Step 1 – Understanding BERT and XLM-RoBERTa

**BERT** (Bidirectional Encoder Representations from Transformers) is an encoder-only Transformer model trained with two objectives: *Masked Language Modeling* and *Next Sentence Prediction*.  
**XLM-RoBERTa** is a multilingual version of RoBERTa (optimized BERT), trained on 100+ languages without the NSP task.  
Both use subword tokenization (BERT uses WordPiece, XLM-R uses SentencePiece) and add special tokens (`[CLS]`, `[SEP]` or `<s>`, `</s>`).  
They convert raw text into token IDs and attention masks so the model can process it.


In [6]:
from transformers import BertTokenizer, XLMRobertaTokenizer
from pprint import pprint

text = "Raw desserts without nuts are delicious!"

# BERT tokenizer
bert_tok = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
bert_enc = bert_tok(text)

print("BERT tokenizer")
print("Tokens:", bert_tok.tokenize(text))
print("Input IDs:", bert_enc["input_ids"])
print("Special tokens: [CLS] ... [SEP]\n")

#  XLM-RoBERTa tokenizer
xlmr_tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
xlmr_enc = xlmr_tok(text)

print("XLM-RoBERTa tokenizer")
print("Tokens:", xlmr_tok.tokenize(text))
print("Input IDs:", xlmr_enc["input_ids"])
print("Special tokens: <s> ... </s>")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

=== BERT tokenizer ===
Tokens: ['Raw', 'desse', '##rts', 'without', 'nu', '##ts', 'are', 'del', '##icio', '##us', '!']
Input IDs: [101, 30712, 23633, 26215, 13663, 11085, 10806, 10301, 10127, 38036, 10251, 106, 102]
Special tokens: [CLS] ... [SEP]



tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

=== XLM-RoBERTa tokenizer ===
Tokens: ['▁Raw', '▁dessert', 's', '▁without', '▁nu', 'ts', '▁are', '▁de', 'li', 'cious', '!']
Input IDs: [0, 109997, 99015, 7, 15490, 315, 933, 621, 8, 150, 60744, 38, 2]
Special tokens: <s> ... </s>


**Conclusion:**  
Both tokenizers split text into subwords, add special tokens, and output `input_ids` and `attention_mask`.  
For multilingual text (including Romanian), **XLM-RoBERTa** usually performs better, so it will be used for fine-tuning in the next steps.


2. Tokenizing Text
Objective: Understand how to tokenize text using pre-trained tokenizers.

Instructions:

* Use the BertTokenizer and XLMRobertaTokenizer to convert sentences into tokenized input.
* Explore the different token types, such as input_ids, attention_mask, and labels.
* Experiment with single-sentence and two-sentence tokenization.
Functions to use:

tokenizer.encode_plus()
tokenizer.decode()

# Step 2 – Tokenizing Text

**Goal:** Learn how to tokenize sentences with pre-trained tokenizers.  
We’ll use `BertTokenizer` and `XLMRobertaTokenizer` to produce:
- `input_ids` → numerical IDs for each token  
- `attention_mask` → 1 for real tokens, 0 for padding  
- `token_type_ids` → sentence A/B indicators (only for BERT)  
We’ll also decode the IDs back to text to see the difference between single- and two-sentence inputs.


In [7]:
from transformers import BertTokenizer, XLMRobertaTokenizer
from pprint import pprint

text_a = "Raw desserts without nuts are delicious!"
text_b = "They are also very healthy."

# --- BERT tokenizer (uses token_type_ids) ---
bert_tok = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

# Single-sentence tokenization
single = bert_tok.encode_plus(
    text_a,
    add_special_tokens=True,
    return_attention_mask=True,
    return_tensors=None
)

# Two-sentence tokenization
pair = bert_tok.encode_plus(
    text_a,
    text_b,
    add_special_tokens=True,
    return_attention_mask=True,
    return_tensors=None
)

print("=== BERT single sentence ===")
pprint(single)
print("\n=== BERT two sentences ===")
pprint(pair)
print("\nDecoded two-sentence text:")
print(bert_tok.decode(pair["input_ids"]))

# --- XLM-RoBERTa tokenizer (no token_type_ids) ---
xlmr_tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

xlmr_single = xlmr_tok.encode_plus(text_a, add_special_tokens=True)
xlmr_pair   = xlmr_tok.encode_plus(text_a, text_b, add_special_tokens=True)

print("\n=== XLM-R single sentence ===")
pprint(xlmr_single)
print("\n=== XLM-R two sentences ===")
pprint(xlmr_pair)
print("\nDecoded two-sentence text:")
print(xlmr_tok.decode(xlmr_pair["input_ids"]))


=== BERT single sentence ===
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101,
               30712,
               23633,
               26215,
               13663,
               11085,
               10806,
               10301,
               10127,
               38036,
               10251,
               106,
               102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

=== BERT two sentences ===
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101,
               30712,
               23633,
               26215,
               13663,
               11085,
               10806,
               10301,
               10127,
               38036,
               10251,
               106,
               102,
               11696,
               10301,
               10379,
               12558,
               89601,
               119,
               102],
 'token_type_ids': [0, 0, 0, 0

**Conclusion:**  
* encode_plus() returns all the necessary tensors for model input:
- input_ids – numeric token IDs  
- attention_mask – which tokens should be attended to  
- token_type_ids – distinguish sentence A/B (BERT only)  

* decode() reconstructs text from token IDs.  
BERT adds [CLS] and [SEP]; XLM-RoBERTa uses <s> and </s>.  
Next step: apply tokenization to the real training dataset.


3. Preparing Input Data for the Model
Objective: Format input data correctly for transformer models.

Instructions:

Ensure that input sentences are padded and possibly truncated to max_length.
Understand and set special tokens such as <s> and </s>.
Learn about attention_mask and how it helps the model ignore padding tokens.
Functions to use:

tokenizer.encode_plus()
tokenizer.special_tokens_map
tokenizer.vocab_size

# Step 3 – Preparing Input Data for the Model

**Goal:** Format text correctly before sending it to a Transformer model.

Each input needs:
- `input_ids`: the tokenized IDs of words/subwords.
- `attention_mask`: tells the model which tokens are real (1) and which are padding (0).
- `special tokens`: `<s>` and `</s>` for XLM-R, `[CLS]` and `[SEP]` for BERT.

We must also pad or truncate sequences to a fixed `max_length` so all batches have the same size.


In [None]:
from transformers import BertTokenizer, XLMRobertaTokenizer
from pprint import pprint

# Example text samples
texts = [
    "Raw desserts without nuts are delicious!",
    "They are healthy and full of flavor.",
    "Learning how to prepare them is fun!"
]

# --- Choose tokenizer (try both if you want) ---
xlmr_tok = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
bert_tok = BertTokenizer.from_pretrained("bert-base-multilingual-cased")

# Inspect the special tokens
print("=== Special tokens ===")
print("XLM-R:", xlmr_tok.special_tokens_map)
print("BERT:", bert_tok.special_tokens_map)

# Choose one model for demonstration (XLM-R recommended for Romanian text)
tokenizer = xlmr_tok
MAX_LEN = 16

encoded = tokenizer.encode_plus(
    texts[0],
    add_special_tokens=True,
    max_length=MAX_LEN,
    truncation=True,
    padding='max_length',    # pad to MAX_LEN
    return_attention_mask=True,
    return_tensors=None
)

print("\n=== Encoded single text (padded & truncated) ===")
pprint(encoded)
print(f"Vocab size: {tokenizer.vocab_size}")
print(f"Input length: {len(encoded['input_ids'])}")

# Show how attention mask works
pairs = list(zip(encoded["input_ids"], encoded["attention_mask"]))
print("\n(token_id, mask) pairs:")
for tid, mask in pairs:
    print(f"{tid:5d} | {mask}")

#**Explanation:**
#- `padding='max_length'` ensures all sentences have the same size.
#- `truncation=True` cuts longer sentences to fit `max_length`.
#- `attention_mask` = 1 for real tokens, 0 for padding (so the model ignores them).
#- `special_tokens_map` shows which symbols (`<s>`, `</s>`, `[CLS]`, `[SEP]`) are used.
#- `vocab_size` tells how many unique tokens the model can recognize.



# Step 4 – Loading and Exploring the Dataset

**Goal:** Load the dataset and explore its structure before training.

We'll:
- Read the CSV files for training and testing.
- Display the first few rows to understand how the data is organized.
- Identify which columns contain the text and the target labels.

Typical columns:
- `text` → the sentence or document to classify.
- `label` → the category (e.g., 0 or 1).

We'll also check the dataset size and possible missing values.


In [8]:
import pandas as pd

# --- Load the datasets ---
train_df = pd.read_csv("dataset/train.csv")
test_df = pd.read_csv("dataset/test.csv")
sample_df = pd.read_csv("Basics of BERT and XLM-RoBERTa - PyTorch/sample_submission.csv")

# --- Explore the structure ---
print("=== TRAIN DATA ===")
print("Shape:", train_df.shape)
display(train_df.head())

print("\n=== TEST DATA ===")
print("Shape:", test_df.shape)
display(test_df.head())

print("\n=== SAMPLE SUBMISSION ===")
display(sample_df.head())

# Optional: quick data info
print("\nData types:")
print(train_df.dtypes)

print("\nCheck missing values:")
print(train_df.isnull().sum())

# **Conclusion:**
# - The dataset was successfully loaded and inspected.
# - The main columns needed for training are usually `text` and `label`.
# - Next step: clean and tokenize the `text` column to prepare it for fine-tuning.



=== TRAIN DATA ===
Shape: (12120, 6)


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1



=== TEST DATA ===
Shape: (5195, 5)


Unnamed: 0,id,premise,hypothesis,lang_abv,language
0,c6d58c3f69,بکس، کیسی، راہیل، یسعیاہ، کیلی، کیلی، اور کولم...,"کیسی کے لئے کوئی یادگار نہیں ہوگا, کولمین ہائی...",ur,Urdu
1,cefcc82292,هذا هو ما تم نصحنا به.,عندما يتم إخبارهم بما يجب عليهم فعله ، فشلت ال...,ar,Arabic
2,e98005252c,et cela est en grande partie dû au fait que le...,Les mères se droguent.,fr,French
3,58518c10ba,与城市及其他公民及社区组织代表就IMA的艺术发展进行对话&amp,IMA与其他组织合作，因为它们都依靠共享资金。,zh,Chinese
4,c32b0d16df,Она все еще была там.,"Мы думали, что она ушла, однако, она осталась.",ru,Russian



=== SAMPLE SUBMISSION ===


Unnamed: 0,id,prediction
0,c6d58c3f69,1
1,cefcc82292,1
2,e98005252c,1
3,58518c10ba,1
4,c32b0d16df,1



Data types:
id            object
premise       object
hypothesis    object
lang_abv      object
language      object
label          int64
dtype: object

Check missing values:
id            0
premise       0
hypothesis    0
lang_abv      0
language      0
label         0
dtype: int64


# Step 5 – Creating Cross-Validation Folds

**Goal:** Create several train/validation splits while keeping label proportions balanced.

We'll use **StratifiedKFold** from `sklearn.model_selection` to create 5 folds.  
Each fold will contain a unique validation set and the remaining data for training.

This helps evaluate model performance more reliably, especially on small datasets.


In [9]:
from sklearn.model_selection import StratifiedKFold
import numpy as np
import pandas as pd

# Assume we already have train_df with columns ['text', 'label']
print("Dataset shape:", train_df.shape)
print(train_df['label'].value_counts())

# --- Initialize StratifiedKFold ---
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Lists to store train/validation indexes
folds = []

for fold, (train_idx, val_idx) in enumerate(kf.split(train_df, train_df['label']), 1):
    print(f"\n=== Fold {fold} ===")
    print(f"Train indices: {len(train_idx)}, Validation indices: {len(val_idx)}")

    train_data = train_df.iloc[train_idx]
    val_data   = train_df.iloc[val_idx]

    folds.append((train_data, val_data))

    # Optional: quick label check
    print("Label distribution (val):")
    print(val_data['label'].value_counts(normalize=True))

# Example: access first fold
train_fold_1, val_fold_1 = folds[0]
print("\nFirst fold samples:")
display(train_fold_1.head())
display(val_fold_1.head())

#**Conclusion:**
#- 5 folds were created using `StratifiedKFold` with shuffled data.
#- Each fold keeps the same label ratio as the original dataset.
#- We'll use these splits later to train and validate the model across multiple runs.



Dataset shape: (12120, 6)
label
0    4176
2    4064
1    3880
Name: count, dtype: int64

=== Fold 1 ===
Train indices: 9696, Validation indices: 2424
Label distribution (val):
label
0    0.344884
2    0.334983
1    0.320132
Name: proportion, dtype: float64

=== Fold 2 ===
Train indices: 9696, Validation indices: 2424
Label distribution (val):
label
0    0.344472
2    0.335396
1    0.320132
Name: proportion, dtype: float64

=== Fold 3 ===
Train indices: 9696, Validation indices: 2424
Label distribution (val):
label
0    0.344472
2    0.335396
1    0.320132
Name: proportion, dtype: float64

=== Fold 4 ===
Train indices: 9696, Validation indices: 2424
Label distribution (val):
label
0    0.344472
2    0.335396
1    0.320132
Name: proportion, dtype: float64

=== Fold 5 ===
Train indices: 9696, Validation indices: 2424
Label distribution (val):
label
0    0.344472
2    0.335396
1    0.320132
Name: proportion, dtype: float64

First fold samples:


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1
5,ed7d6a1e62,"Bir çiftlikte birisinin, ağıla kapatılmış bu ö...",Çiftlikte insanlar farklı terimler kullanırlar.,tr,Turkish,0


Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
6,5a0f4908a0,ریاست ہائے متحدہ امریکہ واپس آنے پر، ہج ایف بی...,ہیگ کی تفتیش ایف بی آئی اہلکاروں کی طرف سے کی...,ur,Urdu,0
21,ad5a79456e,Increased saving by current generations would ...,Current generations' increased saving would ex...,en,English,0
26,6a98a077a5,"It will be COLOSSAL!""",It will be gigantic.,en,English,0
27,00c0cdf348,Lạnh hơn và xa hơn bao giờ hết đã phát triển t...,Giọng của Chúa cảm thấy thật xa xôi và lạnh lẽo,vi,Vietnamese,0
