# Introduction to the Ongoing Example

In this notebook you will get to know the example machine learning task we will consider for most of the exercises throughout the course: We will finetune the GPT-neo language model by EleutherAI on the Stanford IMDb movie review data set to obtain a model specialised in generating movie reviews.

Since both the model and the data set are availabe from huggingface.co, we will use the libraries provided by HuggingFace, which present a slightly higher level abstraction of training with PyTorch.

This notebook does not yet perform any training but demonstrates loading the model and allows you to perform inference, i.e., generating some text with it. It also loads the training data set for you to explore.

We begin by loading the required Python modules, but before that we first need to set environment variable to point to a shared cache directory which `transformers` uses when loading the model, so it does not have to download the same model repeatedly:

In [None]:
import os
os.environ["HF_HOME"] = "/scratch/project_465001363/hf-cache"

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

...and determining the device on which to run the model. Even though LUMI uses AMD MI250x GPUs, PyTorch still use `cuda` when we mean "GPU".
The following should print: "Using device: cuda".
If this is not the case, then we have made a mistake in allocating resources for the job or loading the proper software environment.

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(f"Using device {device}")
if device.type == 'cuda':
    print(f"Device name is {torch.cuda.get_device_name(device)}")

## Meet the Pre-Trained Base Model

Now we can load the actual model. We use the 1.3 billion parameter variant of the GPT-neo model, which takes about 5.4 GiB of VRAM in its native 32-bit float form. A single Graphics Compute Die (GCD (i.e., a GPU)) on LUMI has 64 GiB of VRAM, so we do not need to worry about our memory footprint at this point. We also set up the corresponding tokenizer the model was trained with.

In [None]:
pretrained_model = "EleutherAI/gpt-neo-1.3B"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(pretrained_model)
model.to(device)

With the tokenizer and model set up and loaded to the GPU, we can now use the model to generate some text. Since we ultimately want to generate movie reviews (after finetuning), let's see how well the GPT-neo model does in generating reviews prior to finetuning.

In [None]:
with torch.no_grad():
    prompt = "The movie 'How to run ML on LUMI - A documentation' was great because"
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    outputs = model.generate(**inputs, do_sample=True, max_length=80, num_return_sequences=4)
    decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    print('Sample generated reviews:')
    for i, txt in enumerate(decoded_outputs):
        print("#######################")
        print(f"{i+1}: {txt}")

These do probably not all look like reviews for movies (although some probably start of somewhat promising, then deviate into something that looks more like a blog post or similar). 
In the next exercises we will train the model on the IMDb data set to make it generate better movie reviews.

At this point, you can experiment with the text generation if you wish. Text generation strategies are discussed here: https://huggingface.co/docs/transformers/generation_strategies . You can also change the input prompt.

In particular, these parameters fo `model.generate` might be interesting:

  - `max_new_tokens`: the maximum number of tokens to generate,
  - `num_beams`: activate Beam search by setting this > 1,
  - `do_sample`: activate multinomial sampling if set to `True`,
  - `num_return_sequences`: the number of candidate sentences to return (available only for beam search and sampling).

For a more detailed description of how to perform generation with different decoding methods / search strategies with the `transformers` module, you may want to read this blog post: https://huggingface.co/blog/how-to-generate

## Meet the Training Data

Finally, let us have a look at the training data. The Standford IMDb movie data set was primarily set up for sentiment analysis tasks and consists of 100'000 movie reviews, 50'000 of which are annotated with a sentiment label while the remainder are unlabelled ("unsupervised"). Of the labelled reviews, 25'000 are designated for testing.

The `datasets` module makes it easy to load from huggingface.co . For our purposes we use both the labelled and unlabelled training splits (`train` and `unsupervised`).

Since the data set is relatively small (only a couple hundred MB), we can keep it entirely in memory and not have to worry about filesystem IO.

In [None]:
train_dataset = load_dataset("imdb", split="train+unsupervised", trust_remote_code=False, keep_in_memory=True)

Let's have a look at an example from the training data:

In [None]:
train_dataset[200]

We can see that each element has the review text as well as a sentiment label. We will ignore the label in the following exercises as we are only interested in fine-tuning the model to generate texts that look like IMDB movie reviews.