# Deep Learning with Python

Welcome to the **Deep Learning** course! This course is designed to give you hands-on experience with the foundational concepts and advanced techniques in deep learning. You will explore:

- Artificial Neural Networks and Gradient Descent
- Convolutional Neural Networks (CNNs) for Computer Vision
- Recurrent Neural Networks (RNNs) for Text Prediction
- Diffusion Transformers for Image Generation

Throughout the course, you'll engage in projects to solidify your understanding and gain practical skills in implementing deep learning algorithms.  

Instructor: Dr. Adrien Dorise  
Contact: adrien.dorise@hotmail.com  

---

## Part 3 - Recurent Neural Networks for text prediction

In this project, you will build a language model using Long Short-Term Memory (LSTM) networks in PyTorch. This exercise will guide you through the essential steps:  

**1. Data Management**  

- **Objective**: Divide the dataset into training, validation, and test sets to ensure unbiased evaluation.
- **Steps**:
   - Loading and preprocessing a text dataset  
   - Preparing the dataset for sequential modelling  
   - Splitting data into training and testing sets  

**2. Tokenization**

- **Objective**: Transform text data into numerical values to be fed to a neural network.
- **Steps**:
   - Converting text into numerical representations using tokenization  
   - Padding sequences for uniform input size  
   - Creating word embeddings for efficient learning  

**3. Building an LSTM Model with PyTorch**  

- **Objective**: Build a language model based on LSTM architecture.
- **Steps**:
   - Defining an LSTM architecture  
   - Initializing model parameters  
   - Training the model with an appropriate loss function and optimizer  

**4. Prediction and Evaluation**  

- **Objective**: Evaluate the model performance on text generation
- **Steps**:
   - Generating text predictions using the trained model  
   - Evaluating model performance using metrics like perplexity  
   - Fine-tuning hyperparameters for better results  


By completing this exercise, you will gain hands-on experience in implementing LSTMs for text prediction while understanding key challenges like sequence dependency, vanishing gradients, and text generation strategies.

*Credits*: This project is inspired by the *Text generation with LSTM in Pytorch* article on Machine Learning Mastery by Jason Brownlee, PHD.

---

## Dataset


In this exercise, we will take the data from **Alice in Wonderland** by Lewis Caroll  
The data can be found in the text file *alice_in_wonderland.txt* file in the same folder.  

The book was downloaded from this website:
`https://www.gutenberg.org/ebooks/11`

To simplify the prediction, the entire dataset is set to lowercase.


https://machinelearningmastery.com/text-generation-with-lstm-in-pytorch/  



In [None]:
# Import the txt file into a Python variable

filename = "alice_in_wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower() #Convert to lower case
#print(raw_text)

## Tokenisation

As you may have noticed, neural networks take input values and predict output values.
The keyword here is value. However, when we create a language model (for translation, correction…), we do not take values as input, but text.  

In order to take text as input, we have to translate it into a neural network friendly format: scalars.  
This process is called tokenisation.

Three main methods are used in tokenisation:
- Character based tokenisation
- Word based tokenisation
- Subword based tokenisation

<img src="../docs/tokenisation.jpg" alt="tokenisation" width="800"/>  


In the code snippet below is a class **Tokenizer** that implements all three tokenisation methods.  
In the snippet below, the `Tokenizer` class tokenizes text using Character, Word, or Subword (BPE) methods. It builds a vocabulary from the input text and supports encoding and decoding. 

### Methods
- `fit(text)`: Builds vocabulary from input text. It is called at the initialisation  of the class.
- `encode(text)`: Converts text into tokenized representation.
- `decode(tokens)`: Converts tokens back into text.

### Your job:
  - A short example using this class is given after 
  - Test all three methods and explain the differences in token lengths and total tokens needed to map the dataset.
  - You can glimpse the vocabulary used during tokenisation with the *vocab* variable.


In [None]:
from enum import Enum
import tokenizers as t
import re 

class Tokenisation_method(Enum):
    CHARACTER = 1
    WORD = 2
    SUBWORD = 3

class Tokenizer:
    def __init__(self, input_text, tokenisation_method=Tokenisation_method.CHARACTER):
        self.method = tokenisation_method
        self.input_text = input_text.lower()
        self.vocab = self.fit(self.input_text)
        self.n_vocab = len(self.vocab)

    def fit(self, text):
            """Builds a vocabulary from the given text."""

            if self.method is Tokenisation_method.CHARACTER:
                # Create a vocabulary mapping each unique character to a unique integer.
                # We add -1 at idx to account the fact that dict starts at 1 and not 0
                vocab = {char: idx-1 for idx, char in enumerate(sorted(set(text)), start=1)}
            
            elif self.method is Tokenisation_method.WORD:
                # Create a vocabulary mapping each unique word to a unique integer
                # We add -1 at idx to account the fact that dict starts at 1 and not 0
                
                # Tokenize while preserving '\n' as a separate token
                tokens = re.findall(r'\S+|\n', text)  # Matches words and newline characters
                vocab = {word: idx - 1 for idx, word in enumerate(sorted(set(tokens)), start=1)}
            
            elif self.method is Tokenisation_method.SUBWORD:
                # Subword-based tokenization using Byte Pair Encoding (BPE)
                  
                if isinstance(text, str):
                    text = [text]  # Convert to a list

                # Initialise BPE tokenizer
                self.subword_tokenizer = t.Tokenizer(t.models.BPE(unk_token="<UNK>"))
                self.subword_tokenizer.pre_tokenizer = t.pre_tokenizers.ByteLevel(add_prefix_space=True)
                self.subword_tokenizer.decoder = t.decoders.ByteLevel()

                # Train the tokenizer on provided text
                trainer = t.trainers.BpeTrainer(vocab_size=2000,special_tokens=["<UNK>"])
                self.subword_tokenizer.train_from_iterator(text, trainer=trainer)
                vocab = self.subword_tokenizer.get_vocab()

            return vocab
    

    def encode(self, text):
        """Tokenize an input text based on the vocabulary learned when initialising the tokenizer."""

        if self.method is Tokenisation_method.CHARACTER:
            return [self.vocab[char] for char in text if char in self.vocab]
        
        elif self.method is Tokenisation_method.WORD:
            words = re.findall(r'\S+|\n', text)
            return [self.vocab[w] for w in words if w in self.vocab]
        
        elif self.method is Tokenisation_method.SUBWORD:
            return self.subword_tokenizer.encode(text).ids
            
        
    
    def decode(self, tokens):
        """Convert a list of integer tokens back into text."""

        if self.method is Tokenisation_method.CHARACTER:
            inv_vocab = {idx: char for char, idx in self.vocab.items()}
            return ''.join(inv_vocab[token] for token in tokens if token in inv_vocab)
        
        elif self.method is Tokenisation_method.WORD:
            inv_vocab = {idx: word for word, idx in self.vocab.items()}
            return ' '.join(inv_vocab[token] for token in tokens if token in inv_vocab)
        
        elif self.method is Tokenisation_method.SUBWORD:
            return self.subword_tokenizer.decode(tokens)

In [None]:
#TODO

tokenizer = 
input = 
tokens = 

print(f"Input text: {input}")
print(f"Encoded tokens: {tokens}")
print(f"Decoded text: {tokenizer.decode(tokens)}")
print(f"Vocabulary size: {len(tokenizer.vocab)}")
print(f"Vocabulary: {tokenizer.vocab}")


## Dataset preparation

In order to train an LSTM model for text prediction, the raw tokenised data must be preprocessed. This involves normalising the tokens, converting them into sequences, and preparing them as PyTorch tensors. 
Indeed, compared to non-sequential architectures, each sample of a LSTM is a sequence of multiple tokens.

In the code snippet below, the `Transform_Tokens` class is designed to handle these preprocessing steps. It includes methods to scale tokens using MinMax scaling, transform token sequences into training pairs, and convert them into PyTorch tensors.  

The key steps in this process are:  
- **Scaling the tokens:** The MinMaxScaler maps tokens to a range between 0 and 1 for better neural network performance.  
- **Creating input sequences:** The raw token list is converted into overlapping sequences of a fixed length, with the next token as the target.  
- **Converting sequences to tensors:** The sequences are reshaped to match the expected input format for PyTorch models.  

### Your job:
  - Transform the tokens into a valid LSTM input using Transform_Tokens.
  - Analyse the inputs and outputs of the tensor.
  - Print a few samples with their targets as tensor values.



In [None]:
import torch
from sklearn.preprocessing import MinMaxScaler
import numpy as np 
import math

class Transform_Tokens:
    def __init__(self, non_scaled_tokens):
        self.scaler = MinMaxScaler(feature_range=(0,1))
        self.scaler.fit(np.array(non_scaled_tokens).reshape(-1, 1))

    def transform_tokens(self,tokens, sequence_length, with_target=True):
        features, targets = self.token_to_sequence(tokens,sequence_length, with_target)
        features, targets = self.sequence_to_torch(features, targets)
        return features, targets
    
    def scale_tokens(self, unscaled_tokens):
        data = np.array(unscaled_tokens).reshape(-1, 1)
        return self.scaler.transform(data).flatten()

    def unscale_tokens(self, scaled_tokens):
        data = np.asarray(scaled_tokens).reshape(-1, 1)
        unscaled = self.scaler.inverse_transform(data).flatten()
        unscaled = [math.ceil(token) for token in unscaled]
        return unscaled

    def token_to_sequence(self,tokens, sequence_length=100, with_target=True):
        """prepare the dataset of input to output pairs and normalise the features"""
        features = []
        targets = []
        scaled_tokens = self.scale_tokens(tokens)
        num_sequences = len(tokens) - sequence_length
        if not with_target:
            num_sequences += 1
        for i in range(num_sequences):
            seq_in = scaled_tokens[i:i + sequence_length]
            features.append([tok for tok in seq_in])
            if with_target:
                seq_out = tokens[i + sequence_length]
                targets.append(seq_out)
        return features, targets

    def sequence_to_torch(self, sequences, targets,):
        # reshape X to be [batch size, time steps, features]
        seq_length = len(sequences[0])
        sequences = torch.tensor(sequences, dtype=torch.float32).reshape(len(sequences), seq_length, 1)
        targets = torch.tensor(targets)
        return sequences, targets
    


In [None]:
# TODO

sequence_length = 
transform = 

print(f"Input tokens: {tokens}\n")

features, targets = 

print(f"Scaled torch Feature -> Unscaled Torch Target")
for i in range(5):
    print(f"{features[i].flatten()} -> {targets[i]}")

print(f"\nUnscaled Feature -> Unscaled Target")
for i in range(5):
    unscaled_features = transform.unscale_tokens(features[i])
    print(f"{unscaled_features} -> {targets[i]}")

## Model building

Now that we have the dataset nice and running, we can focus on creating the LSTM model as well as the training function.  

Below, the `LanguageModel` class implements an LSTM-based neural network for text prediction. This model takes sequences of tokenized text as input and predicts the next token. Similarly to CNN, you have to create your own torch Module and implement the different layers of your model, as well as the forward function.  

**Your job:**
- Complete the `LanguageModel` class below to create a LSTM model. Here are the tips for the architecture:
    - **LSTM layers:** The model consists of LSTM layers. LSTMs are effective for handling sequential data because they maintain long-term dependencies.  
    - **Dropout layer:** A dropout layer to help prevent overfitting by randomly dropping connections during training.  
    - **Linear layer:** A fully connected layer to map the LSTM outputs to the vocabulary size, producing logits for token prediction.  
    - **Activation function:** The ReLU activation function introduces non-linearity in the output.  

#### Forward Pass  
- The input passes through the LSTM layers.  
- Only the last output from the sequence is taken to make a prediction.  
- The dropout layer is applied to prevent overfitting.  
- The final output is produced through the linear layer and activated using an activation function (you can start with ReLU).  

*Note*: I want you to focus particularly on the output of the linear layer. 
- What is its output size? 
- What does it mean about the way we are predicting tokens?
- Don't hesitate to ask your professor if you have any doubt, I am here for that!


In [None]:
import torch.nn as nn
import torch.nn.functional as F


 
class LanguageModel(nn.Module):
    def __init__(self, n_vocab, sequence_length):
        super().__init__()
        self.sequence_length = sequence_length
        self.lstm = nn.LSTM(input_size=1, hidden_size=256, num_layers=2, batch_first=True)
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(256, n_vocab)
    
    def forward(self, x):
        x, _ = self.lstm(x)
        # take only the last output
        x = x[:, -1, :]
        # produce output
        x = self.linear(self.dropout(x))
        x = F.relu(x)
        return x

The `fit` function is responsible for training the LSTM model using the Adam optimizer and cross-entropy loss. It processes the dataset in batches and performs training for a specified number of epochs.
In short, this training function takes care of getting prediction out of your model and implements **backpropagation through time**.

#### Training Steps  
1. **Data Preparation:**  
   - The dataset is wrapped in a PyTorch `DataLoader`, which enables batch processing and shuffling.  
   - The loss function used is `CrossEntropyLoss`, which is standard for multi-class classification tasks.  

2. **Training Loop:**  
   - The model is set to training mode.  
   - For each batch:
     - Predictions are made using the forward pass.  
     - The loss between predictions and actual targets is calculated.  
     - Gradients are computed using backpropagation.  
     - The optimizer updates the model’s parameters.  

3. **Validation Step:**  
   - The model is switched to evaluation mode (`model.eval()`).  
   - The loss is computed without updating gradients.  
   - The best model (with the lowest validation loss) is saved.  

4. **Model Checkpointing:**  
   - The best-performing model state is saved as `"language_model.pth"`, ensuring the best version is available for future use.  

This function ensures that the LSTM model is efficiently trained and evaluated, leading to optimal performance on text prediction tasks.  

In [None]:
import torch.optim as optim
import torch.utils.data as data
import numpy as np


def fit(train_set, val_set, n_epochs, batch_size, model, device):
    optimizer = optim.Adam(model.parameters(),lr=0.001)
    loss_fn = nn.CrossEntropyLoss(reduction="sum")
    
    feat_train, targ_train = train_set
    feat_val, targ_val = val_set
    loader_train = data.DataLoader(data.TensorDataset(feat_train, targ_train), shuffle=True, batch_size=batch_size)
    loader_val = data.DataLoader(data.TensorDataset(feat_val, targ_val), shuffle=False, batch_size=batch_size)
    losses = [[],[]]
    best_model = None
    best_loss = np.inf
    for epoch in range(n_epochs):
        model.train()
        epoch_loss = 0
        for X_batch, y_batch in loader_train:
            y_pred = model(X_batch.to(device))
            loss = loss_fn(y_pred, y_batch.to(device))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            epoch_loss += loss.detach().cpu()
        losses[0].append(epoch_loss/batch_size)
        # Validation
        model.eval()
        loss = 0
        with torch.no_grad():
            for X_batch, y_batch in loader_val:
                y_pred = model(X_batch.to(device))
                loss += loss_fn(y_pred, y_batch.to(device))
            if loss < best_loss:
                best_loss = loss
                best_model = model.state_dict()
            losses[1].append(loss.cpu()/batch_size)
            print(f"Epoch {epoch+1}: Train loss: {losses[0][epoch]:.2f} / Val loss: {losses[1][epoch]:.2f}")
    torch.save([best_model, tokenizer], "language_model.pth")
    return losses


In [None]:
# Add GPU support
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(torch.cuda.is_available())
print(device)


## Training the model

All there is to do now is use all previously created classes and functions to train the language model on Alice in Wonderland!

**Your job**
- Set the parameters
    - sequence length
    - number of epochs
    - batch size
    - tokeniser method
- Initialise the variable
    - Set the input text
    - Initialise the tokeniser
    - Transform the input text into features and targets
    - Initialise the Language model
    - Fit the model
- Plot the result
    - The plotting function is given after.
    - Don't forget that the fit function returns a list of [loss train, loss validation]

In [None]:
#TODO

# Hyperparameters
sequence_length = 
n_epochs = 
batch_size = 
tokenizer_choice =
input = 
validation_ratio = 

# Initisalisation
tokenizer = Tokenizer(input, tokenizer_choice)
tokens = tokenizer.encode(input) 
features, targets = transform.transform_tokens(tokens, sequence_length)
model = LanguageModel(tokenizer.n_vocab, sequence_length).to(device)
train_slice = int(len(features)*(1-validation_ratio))
train_set = (features[:train_slice], targets[:train_slice])
val_set = (features[train_slice:], targets[train_slice:])

# Training
loss = fit(train_set, val_set, n_epochs, batch_size, model, device)


In [None]:
import matplotlib.pyplot as plt 

def plot_losses(train_loss, val_loss):
    plt.plot(train_loss, label='Training loss')
    plt.plot(val_loss, label='validation loss')
    plt.xlabel('Epoch')
    plt.ylabel("Loss")
    plt.title("RNN training")
    plt.grid(True)
    plt.legend()
    plt.show()

plot_losses(loss[0],loss[1])


## Predict

The last step of this project!  
Now that you have all parts working fine, and a **perfectly** trained neural network, you can start to predict some text with your language model.

The predict function has already been written for you.   
This example takes some part of the Alice in Wonderland text and asks the model to predict the next part out of its own predictions.  
Of course, you can use your own text as input and experiment!   

If the model is not giving satisfactory result: back to the modeling board!  

**Try to get the best language model possible**  

Good luck!


In [None]:
import torch

model_weights, tokenizer = torch.load("language_model.pth", weights_only=False)
model = LanguageModel(tokenizer.n_vocab, sequence_length).to(device)
model.load_state_dict(model_weights)

# Generate a prompt
input = raw_text[6991:10000]
tokens = tokenizer.encode(input)
tokens = tokens[0:model.sequence_length]

def predict(tokens, tokenizer, transform, model):
    model.eval()
    print(tokenizer.method)
    if(tokenizer.method in [Tokenisation_method.CHARACTER, Tokenisation_method.SUBWORD]):
        token_splitter = ''
    else:
        token_splitter = ' '
    print(f"Prompt: [{tokenizer.decode(tokens)}]\n")
    with torch.no_grad():
        for i in range(1000):
            # format input array of int into PyTorch tensor
            features, _ = transform.transform_tokens(tokens, model.sequence_length, with_target=False)
            # generate logits as output from the model
            prediction = model(features.to(device))
            # convert logits into one character
            index = [int(prediction.argmax())]
            prediction
            result = tokenizer.decode([idx for idx in index])
            print(result, end=token_splitter)
            # append the new character into the prompt for the next iteration
            tokens.append(index[0])
            tokens = tokens[1:]

    
predict(tokens ,tokenizer, transform, model)


# BONUS

Now that you have a perfect understanding of the RNNs and its limitations, you can try your algorithm on other datasets!
Below are some examples that you can get by using Huggingface datasets packages.  
Each function returns a text string similar to this project's Alice in Wonderland input.  

**AG News**  
Description: Contains news articles categorized into four topics: World, Sports, Business, and Science/Technology.

**WikiText-2 & WikiText-103**  
Description: Extracted from Wikipedia, these datasets provide a large corpus for text prediction.

**Amazon Review Popularity**  
Description: Large-scale dataset of Amazon reviews labelled as positive or negative.

**Yelp Review Polarity**  
Description: Yelp reviews are categorised into positive and negative sentiments.


**Your job**  

- Train LSTM on one of these datasets.
- Try to get the best possible model by varying the hyperparameters.
- Try to vary the sequence length and mesure how it impacts the model.
- Try to get the longest prediction without breaking the model.
- Can the model predict prompts from other dataset?
- Conclude   

*Note*: If the trainnig takes too long, don't hesitate to train on smaller subset of the whole dataset.

In [None]:
from datasets import load_dataset

def load_ag_news():
    """Loads the AG News dataset and returns its text as a single string."""
    dataset = load_dataset("ag_news", split="train")
    return "\n".join([f"{label} {text}" for label, text in zip(dataset["label"], dataset["text"])])

def load_wikitext2():
    """Loads the WikiText-2 dataset and returns its text as a single string."""
    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
    return "\n".join(dataset["text"])  

def load_wikitext103():
    """Loads the WikiText-103 dataset and returns its text as a single string."""
    dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
    return "\n".join(dataset["text"])  

def load_amazon_review():
    """Loads the Amazon Review Polarity dataset and returns its text as a single string."""
    dataset = load_dataset("amazon_polarity", split="train")
    return "\n".join([f"{label} {title} {review}" for label, title, review in zip(dataset["label"], dataset["title"], dataset["content"])])

def load_yelp_review():
    """Loads the Yelp Review Polarity dataset and returns its text as a single string."""
    dataset = load_dataset("yelp_polarity", split="train")
    return "\n".join([f"{label} {text}" for label, text in zip(dataset["label"], dataset["text"])])

In [None]:
# Example Usage

print("\nAG News Sample:")
print(load_ag_news()[:500])

print("\nWikiText-2 Sample:")
print(load_wikitext2()[:500])

print("\nWikiText-103 Sample:")
print(load_wikitext103()[:500])

print("\nAmazon Review Sample:")
print(load_amazon_review()[:500])

print("\nYelp Review Sample:")
print(load_yelp_review()[:500])