## TP5: Pre-training and Fine-tuning a Language Model

### Objectives
1. Understand the basics of pre-training a simple model.  
2. Fine-tune a pre-trained model (BERT) for a text classification task.  
3. Work with real textual data and train a model on it.  

### Part 1: Pre-training (Simple Example)  
#### Introduction  
Simulate simplified pre-training by training a small model to predict masked words in sentences.  

#### 1. Create a Text Dataset  
- Provide a simple corpus (or generate one) containing short sentences.  
- Example corpus:  
```
The cat sleeps on the mat.  
The dog plays in the garden.  
The car is red.  
```  

In [1]:
# Creating a simple corpus
corpus = [
    "The cat sleeps on the mat.",
    "The dog plays in the garden.",
    "The car is red."
]

# Saving the corpus to a text file
with open("corpus.txt", "w", encoding="utf-8") as file:
    for sentence in corpus:
        file.write(sentence + "\n")

print("The corpus has been created and saved in the file 'corpus.txt'.")

The corpus has been created and saved in the file 'corpus.txt'.


#### 2. Preprocess the Data  
- **Tokenization**: Split sentences into words.  
- **Word Masking**: Mask a random word in each sentence to predict it.  
    - Example: "The cat sleeps [MASK] the mat."  


In [2]:
import random

# Loading the corpus
with open("corpus.txt", "r", encoding="utf-8") as file:
    corpus = file.readlines()

# Preprocessing: Tokenization and Masking
preprocessed_data = []
for sentence in corpus:
    # Remove newline characters and tokenize
    tokens = sentence.strip().split()
    
    # Display the tokenization
    print(f"Original sentence: {sentence.strip()}")
    print(f"Tokenization: {tokens}")
    
    # Choose a random word to mask
    if len(tokens) > 1:  # Avoid masking an empty sentence or a single word
        mask_index = random.randint(0, len(tokens) - 1)
        original_word = tokens[mask_index]
        tokens[mask_index] = "[MASK]"
        
        # Reconstruct the masked sentence
        masked_sentence = " ".join(tokens)
        preprocessed_data.append((masked_sentence, original_word))  # Mask + masked word
        
        # Display the masked sentence
        print(f"Masked sentence: {masked_sentence} | Masked word: {original_word}")
        print("-" * 50)  # Separator for readability

# Saving the preprocessed data
with open("preprocessed_data.txt", "w", encoding="utf-8") as file:
    for masked, word in preprocessed_data:
        file.write(f"{masked} | {word}\n")

print("\nThe preprocessed data has been saved to the file 'preprocessed_data.txt'.")


Original sentence: The cat sleeps on the mat.
Tokenization: ['The', 'cat', 'sleeps', 'on', 'the', 'mat.']
Masked sentence: The [MASK] sleeps on the mat. | Masked word: cat
--------------------------------------------------
Original sentence: The dog plays in the garden.
Tokenization: ['The', 'dog', 'plays', 'in', 'the', 'garden.']
Masked sentence: The dog plays in the [MASK] | Masked word: garden.
--------------------------------------------------
Original sentence: The car is red.
Tokenization: ['The', 'car', 'is', 'red.']
Masked sentence: The [MASK] is red. | Masked word: car
--------------------------------------------------

The preprocessed data has been saved to the file 'preprocessed_data.txt'.


#### 3. Create a Simple Model: Build a model capable of predicting a masked word in a sentence.  
- **1. Add an embedding layer** (use `nn.Embedding`): Transform each word into a numerical vector.  
- **2. Add a dense layer** (use `nn.Linear`): Identify complex relationships between words.  
- **3. Add a softmax layer** (use `nn.Softmax`): Convert the outputs into probabilities to predict the masked word.  

#### Minimal Model Structure:
- **Input**: Indices of the words in the sentence.  
- **Output**: Probabilities for each word in the vocabulary.  

#### 4. Train the Model
- **Task**: Predict masked words from their contexts.  
- **Loss Function**: Cross-Entropy Loss.  
- **dEmbedding** = 10  


In [3]:
! pip install transformers torch



In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Loading preprocessed data
class MaskedDataset(Dataset):
    def __init__(self, file_path):
        self.data = []
        self.vocab = set()
        
        # Loading data
        with open(file_path, "r", encoding="utf-8") as file:
            for line in file:
                masked_sentence, word = line.strip().split(" | ")
                self.data.append((masked_sentence, word))
                
                # Building the vocabulary
                self.vocab.update(masked_sentence.split())
                self.vocab.add(word)
        
        # Adding the <PAD> token
        self.vocab.add("<PAD>")
        
        # Converting the vocabulary to indices
        self.word_to_idx = {word: idx for idx, word in enumerate(self.vocab)}
        self.idx_to_word = {idx: word for word, idx in self.word_to_idx.items()}
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        masked_sentence, word = self.data[idx]
        sentence_indices = [self.word_to_idx[token] for token in masked_sentence.split()]
        target_index = self.word_to_idx[word]
        return torch.tensor(sentence_indices, dtype=torch.long), torch.tensor(target_index, dtype=torch.long)
    
    def vocab_size(self):
        return len(self.vocab)

# collate_fn function for padding
def collate_fn(batch):
    # Extract sentences and targets
    sentences, targets = zip(*batch)
    
    # Find the maximum sentence length in the batch
    max_len = max(len(sentence) for sentence in sentences)
    
    # Add padding to standardize lengths
    padded_sentences = [torch.cat([sentence, torch.full((max_len - len(sentence),), dataset.word_to_idx["<PAD>"], dtype=torch.long)]) for sentence in sentences]
    
    # Convert targets to a tensor
    targets = torch.stack(targets)
    
    return torch.stack(padded_sentences), targets

# Initializing the dataset and DataLoader
file_path = "preprocessed_data.txt"
dataset = MaskedDataset(file_path)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

# Creating the model
class MaskedWordPredictionModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(MaskedWordPredictionModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc = nn.Linear(embedding_dim, vocab_size)
        self.softmax = nn.Softmax(dim=-1)
    
    def forward(self, x):
        embeddings = self.embedding(x)
        # Average embeddings for all words in the sentence
        sentence_embedding = embeddings.mean(dim=1)
        logits = self.fc(sentence_embedding)
        probabilities = self.softmax(logits)
        return probabilities

# Parameters
embedding_dim = 10
vocab_size = dataset.vocab_size()
model = MaskedWordPredictionModel(vocab_size, embedding_dim)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training
num_epochs = 5
for epoch in range(num_epochs):
    total_loss = 0
    for sentences, targets in dataloader:
        optimizer.zero_grad()
        
        # Prediction
        outputs = model(sentences)
        
        # Loss calculation
        loss = criterion(outputs, targets)
        total_loss += loss.item()
        
        # Backpropagation
        loss.backward()
        optimizer.step()
    
    # Print average loss for the epoch
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(dataloader):.4f}")
    
    # Test the model on an example after each epoch
    model.eval()
    with torch.no_grad():
        for sentences, targets in dataloader:
            example_sentence = sentences[0]
            example_target = targets[0]
            
            # Prediction for an example
            outputs = model(example_sentence.unsqueeze(0))  # Add a batch dimension
            predicted_index = torch.argmax(outputs, dim=-1).item()
            predicted_word = dataset.idx_to_word[predicted_index]
            original_word = dataset.idx_to_word[example_target.item()]
            
            # Reconstruct the sentence with the predicted word
            example_sentence_tokens = [dataset.idx_to_word[idx.item()] for idx in example_sentence]
            reconstructed_sentence = " ".join(
                token if token != "[MASK]" else predicted_word for token in example_sentence_tokens
            )
            
            print(f"Masked sentence: {' '.join(example_sentence_tokens)}")
            print(f"Predicted word: {predicted_word} (Original word: {original_word})")
            print(f"Reconstructed sentence: {reconstructed_sentence}")
            print("-" * 50)
            break  # Display only the first example


Epoch 1/5, Loss: 2.6819
Masked sentence: The [MASK] sleeps on the mat.
Predicted word: garden. (Original word: cat)
Reconstructed sentence: The garden. sleeps on the mat.
--------------------------------------------------
Epoch 2/5, Loss: 2.6686
Masked sentence: The [MASK] is red. <PAD> <PAD>
Predicted word: car (Original word: car)
Reconstructed sentence: The car is red. <PAD> <PAD>
--------------------------------------------------
Epoch 3/5, Loss: 2.6593
Masked sentence: The dog plays in the [MASK]
Predicted word: garden. (Original word: garden.)
Reconstructed sentence: The dog plays in the garden.
--------------------------------------------------
Epoch 4/5, Loss: 2.6518
Masked sentence: The [MASK] sleeps on the mat.
Predicted word: cat (Original word: cat)
Reconstructed sentence: The cat sleeps on the mat.
--------------------------------------------------
Epoch 5/5, Loss: 2.6300
Masked sentence: The [MASK] sleeps on the mat.
Predicted word: cat (Original word: cat)
Reconstructed 