# Deep Learning with Python

Welcome to the **Deep Learning** course! This course is designed to give you hands-on experience with the foundational concepts and advanced techniques in deep learning. You will explore:

- Artificial Neural Networks and Gradient Descent
- Convolutional Neural Networks (CNNs) for Computer Vision
- Recurrent Neural Networks (RNNs) for Text Prediction
- Diffusion Transformers for Image Generation

Throughout the course, you'll engage in projects to solidify your understanding and gain practical skills in implementing deep learning algorithms.  

Instructor: Dr. Adrien Dorise  
Contact: adrien.dorise@hotmail.com  

---

## Part 4 - Transformer Model for Text Prediction

In this project, you will build a language model using a Transformer architecture in PyTorch. This exercise will guide you through the essential steps:

**1. Dataset preparation**  

- **Objective**: Convert raw text into numerical tokens suitable for a Transformer model.
- **Steps**:
   - Tokenization of the text
   - Creation of batch sequences
- **Note**:
   - This ssection uses the same function that the one seen in the RNN project.
   - Therefore, a script enompassing all these function is given seperately.

**2. Building a Transformer Model with PyTorch**  

- **Objective**: Undestand the parameters of a transformer model. The model is already written in the `model.py` file.
- **Steps**:
   - Analyse the given architecture
   - Understand the hyperparameters in regards to the course
   - Initialise the transformer model.  

**3. Training the Transformer Model**  

- **Objective**: Train the Transformer model using a suitable loss function and optimizer.
- **Steps**:
   - Defining a loss function for text prediction  
   - Optimizing the model using gradient descent  
   - Monitoring training progress and adjusting hyperparameters  

**4. Generating Text with the Transformer Model**  

- **Objective**: Use the trained model to generate new text sequences.
- **Steps**:
   - Implementing a prediction function for text generation  
   - Using temperature scaling and top-k sampling to control randomness  
   - Evaluating model output for coherence and fluency  

**5. Compare with a pretrained transformer**  
- **Objective**: Import a GPT-2 model and make prediction
- **Steps**:
   - Import a GPT-2 model with its tokenizer
   - Tokenize an input prompt
   - Analyse the output in regards to your own llms.  

By completing this exercise, you will gain hands-on experience in implementing Transformers for text prediction while understanding key challenges like attention mechanisms, sequence modeling, and text generation strategies.

*Credits*: This project is inspired by research on Transformer-based language models and builds upon concepts introduced in the "Attention Is All You Need" paper by Vaswani et al. (2017).  
The transformer script was inspired by `https://medium.com/towards-data-science/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb`



## Dataset preperation


In this exercise, we will take the data from **Alice in Wonderland** by Lewis Caroll  
The data can be found in the text file *alice_in_wonderland.txt* file in the same folder.  

The book was dowloaded from this website:
`https://www.gutenberg.org/ebooks/11`

To simplify the prediction, all the dataset is set to lowercase.

The preparation of the dataset is done similarly as the RNN project. Therefore, the function are written in another file called `data_preprocessing.py`. 

In [None]:
from data_preprocessing import Tokenizer, Transform_Tokens, Tokenisation_method

# Dataset parameters
sequence_length = 10
validation_ratio = 0.1
tokenization_method = Tokenisation_method.WORD


# Import the txt file into a Python variable
filename = "alice_in_wonderland.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
raw_text = raw_text.lower() #Convert to lower case

#Unlike RNN, transformers takes the whole input shifted by one as target
tokenizer = Tokenizer(raw_text, tokenization_method)
train_slice = int(len(raw_text)*(1-validation_ratio))
tokens_train = tokenizer.encode(raw_text[:train_slice])
tokens_val = tokenizer.encode(raw_text[train_slice:])
transform = Transform_Tokens(tokens_train)
features_train,targets_train = transform.transform_tokens(tokens_train,sequence_length)
features_val,targets_val = transform.transform_tokens(tokens_val,sequence_length)
train_set = (features_train, targets_train)
val_set = (features_val, targets_val)


print(f"Unscaled torch Feature train -> Unscaled Torch Target train")
print(f"Feature shape: {features_train.shape}")
print(f"Target shape: {targets_train.shape}")
for i in range(5):
    print(f"{features_train[i]} -> {targets_train[i]}")


## Implementing Multi-Head Attention

Multi-head attention allows the model to focus on different parts of the input sequence simultaneously.
In this project, you will compare a model created from scratch and the pre-trained GPT-3 model.

<img src="../docs/transformers.jpg" alt="transformers" width="800"/>  

Below is a PyTorch implementation of multi-head attention. It contains all the blocks described in the course. Don't hesitate to review the figures in the course if you are lost.
The steps are:

- Implement the attention mechanism
- Create the Linear layer
- Implement positional encoding
- Create the encoder part
- Create the decoder part
- Assemble the whole transformer
    - Embedding vectors are created in the transformer class.

As it is a consequent architecture, it was decided to split the module in a separate file.  
Therefore, the whole transformer can be found in `model.py`

Below is the training script to perform backpropagation on a transformer using PyTorch

**Your Job:**
- Understand how the transformer is implemented in PyTorch
- Understand the training function
- Initialise the model, by adjusting the hyperparameters accordingly



In [None]:
import torch
import torch.optim as optim
import torch.utils.data as data
import torch.nn as nn
import numpy as np


def fit(train_set, val_set, n_epochs, batch_size, vocab_size, model, device):
    optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
    loss_fn = nn.CrossEntropyLoss(reduction="mean")

    feat_train, targ_train = train_set
    feat_val, targ_val = val_set
    loader_train = data.DataLoader(data.TensorDataset(feat_train, targ_train), shuffle=False, batch_size=batch_size)
    loader_val = data.DataLoader(data.TensorDataset(feat_val, targ_val), shuffle=False, batch_size=batch_size)
    losses = [[],[]]
    best_model = None
    best_loss = np.inf
    for epoch in range(n_epochs):
        model.train()
        epoch_loss = 0
        for X_batch, y_batch in loader_train:
            y_pred = model(X_batch.to(device), y_batch[:, :-1].to(device))
            loss = loss_fn(y_pred.contiguous().view(-1, vocab_size), y_batch[:, 1:].contiguous().view(-1).to(device))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            epoch_loss += loss.detach().cpu()
        losses[0].append(epoch_loss/batch_size)
        # Validation
        model.eval()
        loss = 0
        with torch.no_grad():
            for X_batch, y_batch in loader_val:
                y_pred = model(X_batch.to(device), y_batch[:, :-1].to(device))
                loss += loss_fn(y_pred.contiguous().view(-1, vocab_size), y_batch[:, 1:].contiguous().view(-1).to(device))
            if loss < best_loss:
                best_loss = loss
                best_model = model.state_dict()
            losses[1].append(loss.cpu()/batch_size)
            print(f"Epoch {epoch+1}: Train loss: {losses[0][epoch]:.4f} / Val loss: {losses[1][epoch]:.4f}")
    torch.save([best_model, tokenizer], "llm.pth")
    return losses



In [None]:
from model import Transformer

#TODO: Ininitialise the transformer model

# Model hyperparameters
vocab_size = 
embedding_dim =
n_attention_heads = 
n_layers = 
linear_dim = 
dropout = 

transformer = Transformer(vocab_size, vocab_size, embedding_dim, n_attention_heads, n_layers, linear_dim, sequence_length, dropout)


## Training the model

Now that your dataset is ready, and your transformer is initialised, all is left to do is to train your own custom LLM!  
The code snippets below will start the training of the model.  
Note that depending on your input data, it can take a while, especially if you are not on GPU.  

Don't hesitate to take only a subset of the whole text as input if it takes too long.

**Your job**:
- Train the LLM
- Plot the loss curves
- Make a prediction on an input prompt
- Improve the model!

In [None]:
# Add GPU support
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(torch.cuda.is_available())
print(device)
transformer = transformer.to(device)

In [None]:
#TODO

# Training parameters
batch_size = 
n_epochs = 

# Training
loss = fit(train_set, val_set, n_epochs, batch_size, vocab_size, transformer, device)


In [None]:
import matplotlib.pyplot as plt 

def plot_losses(train_loss, val_loss):
    plt.plot(train_loss, label='Training loss')
    plt.plot(val_loss, label='validation loss')
    plt.xlabel('Epoch')
    plt.ylabel("Loss")
    plt.title("LLM training")
    plt.grid(True)
    plt.legend()
    plt.show()

plot_losses(loss[0],loss[1])



In [None]:
def predict(tokens, tokenizer, model):
    model.eval()
    print(tokenizer.method)
    if(tokenizer.method in [Tokenisation_method.CHARACTER, Tokenisation_method.SUBWORD]):
        token_splitter = ''
    else:
        token_splitter = ' '
    print(f"Prompt: [{tokenizer.decode([int(tok) for tok in tokens[0]])}]\n")
    with torch.no_grad():
        for i in range(500):
            prediction = transformer(tokens.to(device), tokens.to(device))
            prediction = prediction.view(-1,tokenizer.n_vocab).argmax(1)
            result = tokenizer.decode([int(prediction[-1])])
            print(result, end=token_splitter)

            # append the new character into the prompt for the next iteration
            tokens = torch.cat((tokens,prediction[-1:].unsqueeze(0).cpu()),1)
            tokens = tokens[:,1:]


predict(features_train[0:1,:] ,tokenizer, transformer)


## Compare with pre-trained LLM

Usually, you should have a correctly tuned model by now. It might not be perfect but don't forget that your model is only trained on a very small subset of data.  
For instance, GPT-3 was trained on approximatively **410 billion** tokens with **175 billion parameters**. With a single GPU, it would take 355 years to train it from scratch.  

In the flowllong code snippet, you have a demonstration of a GPT-2 implementation.  
Feel free to experiment with it, and compare it to your own model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

#TODO

prompt = 
n_generated_tokens = 

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=n_generated_tokens)

print(f"Prompt: {prompt}\n")
gen_text = tokenizer.batch_decode(gen_tokens)[0]
print(gen_text)


# Bonus

This is the last bonus section of this course!  
In this one, you will have to compare the LSTM model and the transformer model.  
Find a dataset you like, and try to train both models on it. Be careful about training time though.  

Here are some comparison ideas:
- Training time
- Sequence length influence
- Tokenizer influence
- Correctness of answers
- Versatility (is it able to answer never-seen prompts)
- Robustness to overfitting


# The END!

Congratulations!  
You have now completed this course about artificial intelligence!  
You should now have a good understanding of the basic principles of artificial intelligence, from the beginning in the 50's with the perceptron up to today with the transformer.  

It is a fine knowledge basis on which you construct yourself. You are now well-prepared to tackle new challenges in Deep Learning!