Preprocess & Fine-Tune Transformer-Based Models

1. Understanding BERT And XLM-RoBERTa 
BERT uses WordPiece tokenization, which breaks words down into subwords or word pieces.
- BERT-Base: The standard version with 12 layers (transformer blocks), 768 hidden units, and 12 attention heads.
- BERT-Large: A larger version with 24 layers, 1024 hidden units, and 16 attention heads.
BERT is typically used for tasks like text classification, named entity recognition (NER), question answering (QA), and sentiment analysis.

Like BERT, XLM-RoBERTa uses Byte-Pair Encoding (BPE) for tokenization, a technique that splits words into subword units. This helps the model handle various languages by creating a vocabulary that includes common subwords in all the languages it is trained on. XLM-RoBERTa is trained on 100+ languages.

- XLM-RoBERTa-Base: 12 layers with 768 hidden units and 12 attention heads (similar to BERT-Base but multilingual).
- XLM-RoBERTa-Large: 24 layers, 1024 hidden units, and 16 attention heads (similar to BERT-Large but multilingual).
XLM-RoBERTa excels in tasks involving multiple languages, such as multilingual sentiment analysis, cross-lingual transfer learning, and translation tasks.

In [19]:
from transformers import BertTokenizer, XLMRobertaTokenizer

In [20]:
# 2. Tokenizing Text
# Load pre-trained BERT tokenizer 
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')

text = "Hello, how are you doing today?"

tokenized_output_bert = tokenizer_bert.encode_plus(
    text, 
    add_special_tokens=True,  # Adds [CLS] and [SEP] tokens
    padding='max_length',     # Pad the sequences to a fixed length
    max_length=10,            # Maximum sequence length
    return_tensors='pt',      # Return PyTorch tensors
    truncation=True           # Truncate if input exceeds max_length
)

# Output tokenized result
print("Tokenized BERT output:", tokenized_output_bert)

# Decode the tokenized input back to text
decoded_text_bert = tokenizer_bert.decode(tokenized_output_bert['input_ids'][0], skip_special_tokens=True)
print("Decoded BERT text:", decoded_text_bert)

Tokenized BERT output: {'input_ids': tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 2725, 2651, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Decoded BERT text: hello, how are you doing today?


In [21]:
# Load pre-trained XLM-RoBERTa tokenizer (xlm-roberta-base)
tokenizer_xlm = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

# Sample text
text_xlm = "Привет! Как дела? Я тебя люблю!"

# Tokenizing the text using `encode_plus`
tokenized_output_xlm = tokenizer_xlm.encode_plus(
    text_xlm, 
    add_special_tokens=True,  # Adds [CLS] and [SEP] tokens
    padding='max_length',     # Pad the sequences to a fixed length
    max_length=15,            # Maximum sequence length
    return_tensors='pt',      # Return PyTorch tensors
    truncation=True           # Truncate if input exceeds max_length
)

# Output tokenized result
print("Tokenized XLM-RoBERTa output:", tokenized_output_xlm)

# Decode the tokenized input back to text
decoded_text_xlm = tokenizer_xlm.decode(tokenized_output_xlm['input_ids'][0], skip_special_tokens=True)
print("Decoded XLM-RoBERTa text:", decoded_text_xlm)

Tokenized XLM-RoBERTa output: {'input_ids': tensor([[    0,  1813, 18454,    38,  5187,  7843,    32,  1509, 21136, 81880,
            38,     2,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}
Decoded XLM-RoBERTa text: Привет! Как дела? Я тебя люблю!


In [22]:
# 3. Preparing Input Data For The Model
# Special tokens
print("BERT Special Tokens:", tokenizer_bert.special_tokens_map)
print("XLM-RoBERTa Special Tokens:", tokenizer_xlm.special_tokens_map)

BERT Special Tokens: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
XLM-RoBERTa Special Tokens: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}


In [23]:
# Vocabulary Size
print("BERT Vocabulary Size:", tokenizer_bert.vocab_size)
print("XLM-RoBERTa Vocabulary Size:", tokenizer_xlm.vocab_size)

BERT Vocabulary Size: 30522
XLM-RoBERTa Vocabulary Size: 250002


In [25]:
# Attention mask
print("BERT Input IDs:", tokenized_output_bert["input_ids"])
print("BERT Attention Mask:", tokenized_output_bert["attention_mask"])

print("XLM-RoBERTa Input IDs:", tokenized_output_xlm["input_ids"])
print("XLM-RoBERTa Attention Mask:", tokenized_output_xlm["attention_mask"])

BERT Input IDs: tensor([[ 101, 7592, 1010, 2129, 2024, 2017, 2725, 2651, 1029,  102]])
BERT Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
XLM-RoBERTa Input IDs: tensor([[    0,  1813, 18454,    38,  5187,  7843,    32,  1509, 21136, 81880,
            38,     2,     1,     1,     1]])
XLM-RoBERTa Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])


In [26]:
# 4. Loading And Exploring The Dataset
import pandas as pd
train = pd.read_csv('/Users/patash/PSTB/Week_6_LLM/day_1/train.csv')
test = pd.read_csv('/Users/patash/PSTB/Week_6_LLM/day_1/test.csv')

In [27]:
print(train.head())
print(test.head())

           id                                            premise  \
0  5130fd2cb5  and these comments were considered in formulat...   
1  5b72532a0b  These are issues that we wrestle with in pract...   
2  3931fbe82a  Des petites choses comme celles-là font une di...   
3  5622f0c60b  you know they can't really defend themselves l...   
4  86aaa48b45  ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...   

                                          hypothesis lang_abv language  label  
0  The rules developed in the interim were put to...       en  English      0  
1  Practice groups are not permitted to work on t...       en  English      2  
2              J'essayais d'accomplir quelque chose.       fr   French      0  
3  They can't defend themselves because of their ...       en  English      0  
4    เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร       th     Thai      1  
           id                                            premise  \
0  c6d58c3f69  بکس، کیسی، راہیل، یسعیاہ، کی

In [28]:
print(train.shape)
print(test.shape)

(12120, 6)
(5195, 5)


In [29]:
df_train_filtered = train[['premise', 'hypothesis', 'label']].copy()

print(df_train_filtered.head())

                                             premise  \
0  and these comments were considered in formulat...   
1  These are issues that we wrestle with in pract...   
2  Des petites choses comme celles-là font une di...   
3  you know they can't really defend themselves l...   
4  ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...   

                                          hypothesis  label  
0  The rules developed in the interim were put to...      0  
1  Practice groups are not permitted to work on t...      2  
2              J'essayais d'accomplir quelque chose.      0  
3  They can't defend themselves because of their ...      0  
4    เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร      1  


In [31]:
import torch

# xlm-roberta-base
tokenizer_xlm = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

# Tokenisation des colonnes 'premise' et 'hypothesis'

df_train_filtered['premise_tokenized'] = df_train_filtered['premise'].apply(
    lambda x: tokenizer_xlm.encode_plus(
        x,
        add_special_tokens=True,
        padding='max_length',
        truncation=True,
        max_length=512,
        return_tensors=None
    )
)

df_train_filtered['hypothesis_tokenized'] = df_train_filtered['hypothesis'].apply(
    lambda x: tokenizer_xlm.encode_plus(
        x,
        add_special_tokens=True,
        padding='max_length',
        truncation=True,
        max_length=512,
        return_tensors=None
    )
)

# Extraire les input_ids et attention_mask séparément
df_train_filtered['premise_input_ids'] = df_train_filtered['premise_tokenized'].apply(lambda x: x['input_ids'])
df_train_filtered['premise_attention_mask'] = df_train_filtered['premise_tokenized'].apply(lambda x: x['attention_mask'])

df_train_filtered['hypothesis_input_ids'] = df_train_filtered['hypothesis_tokenized'].apply(lambda x: x['input_ids'])
df_train_filtered['hypothesis_attention_mask'] = df_train_filtered['hypothesis_tokenized'].apply(lambda x: x['attention_mask'])

# Afficher les 5 premières lignes
print(df_train_filtered.head())


                                             premise  \
0  and these comments were considered in formulat...   
1  These are issues that we wrestle with in pract...   
2  Des petites choses comme celles-là font une di...   
3  you know they can't really defend themselves l...   
4  ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...   

                                          hypothesis  label  \
0  The rules developed in the interim were put to...      0   
1  Practice groups are not permitted to work on t...      2   
2              J'essayais d'accomplir quelque chose.      0   
3  They can't defend themselves because of their ...      0   
4    เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร      1   

             premise_tokenized         hypothesis_tokenized  \
0  [input_ids, attention_mask]  [input_ids, attention_mask]   
1  [input_ids, attention_mask]  [input_ids, attention_mask]   
2  [input_ids, attention_mask]  [input_ids, attention_mask]   
3  [input_ids, attention_mask]  

In [32]:
# Cross-validation

import numpy as np
from sklearn.model_selection import StratifiedKFold
from torch.utils.data import TensorDataset

# Conversion en tensors
premise_input_ids = torch.tensor(df_train_filtered['premise_input_ids'].tolist())
hypothesis_input_ids = torch.tensor(df_train_filtered['hypothesis_input_ids'].tolist())
premise_attention_mask = torch.tensor(df_train_filtered['premise_attention_mask'].tolist())
hypothesis_attention_mask = torch.tensor(df_train_filtered['hypothesis_attention_mask'].tolist())
labels = torch.tensor(df_train_filtered['label'].tolist())

# Dataset PyTorch avec 4 entrées : premise_input_ids, hypothesis_input_ids, premise_attention_mask, hypothesis_attention_mask, et labels
dataset = TensorDataset(premise_input_ids, hypothesis_input_ids, premise_attention_mask, hypothesis_attention_mask, labels)

In [None]:
from sklearn.model_selection import StratifiedKFold
from torch.utils.data import TensorDataset, DataLoader

# Création de StratifiedKFold pour générer 5 splits
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Listes pour stocker les DataLoader d'entraînement et de validation pour chaque pli
train_dataloaders = []
val_dataloaders = []

# Appliquer les indices de StratifiedKFold et créer les DataLoader
for train_index, val_index in kf.split(premise_input_ids, labels):
    # Créer des sous-ensembles du dataset pour l'entraînement et la validation
    train_subset = TensorDataset(
        premise_input_ids[train_index],
        hypothesis_input_ids[train_index],
        premise_attention_mask[train_index],
        hypothesis_attention_mask[train_index],
        labels[train_index]
    )
    
    val_subset = TensorDataset(
        premise_input_ids[val_index],
        hypothesis_input_ids[val_index],
        premise_attention_mask[val_index],
        hypothesis_attention_mask[val_index],
        labels[val_index]
    )
    
    # Créer des DataLoader pour les sous-ensembles
    train_dataloader = DataLoader(train_subset, batch_size=32, shuffle=True)
    val_dataloader = DataLoader(val_subset, batch_size=32, shuffle=False)
    
    # Ajouter les DataLoader aux listes
    train_dataloaders.append(train_dataloader)
    val_dataloaders.append(val_dataloader)
