# IMDB movie review text generation

In this notebook, we'll fine-tune a GPT2-like model to generate more movie reviews based on a prompt.

Partly based on this tutorial: https://github.com/omidiu/GPT-2-Fine-Tuning/blob/main/main.ipynb

First, the needed imports. 

In [4]:
%matplotlib inline

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchtext import datasets
import torchtext.transforms as T
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import os
from pprint import pprint

from datasets import load_dataset
from transformers import AutoTokenizer

print('Using PyTorch version:', torch.__version__)
if torch.cuda.is_available():
    print('Using GPU, device name:', torch.cuda.get_device_name(0))
    device = torch.device('cuda')
else:
    print('No GPU found, using CPU instead.') 
    device = torch.device('cpu')

Using PyTorch version: 2.1.2+rocm5.6
Using GPU, device name: AMD Instinct MI250X


## IMDB data set

Next we'll load the IMDB data set, this time using the [Hugging Face datasets library](https://huggingface.co/docs/datasets/index).

The dataset contains 100,000 movies reviews from the Internet Movie Database, split into 25,000 reviews for training and 25,000 reviews for testing and 50,000 without labels (unsupervised).

Let's use the last 5000 of test for evaluation and the rest for training.

In [2]:
train_dataset = load_dataset("imdb", split="train+unsupervised+test[:20000]")
eval_dataset = load_dataset("imdb", split="test[20000:]")

Let's look at one sample from the dataset.

In [12]:
for b in train_dataset:
    pprint(b)
    break

{'label': 0,
 'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the '
         'controversy that surrounded it when it was first released in 1967. I '
         'also heard that at first it was seized by U.S. customs if it ever '
         'tried to enter this country, therefore being a fan of films '
         'considered "controversial" I really had to see this for myself.<br '
         '/><br />The plot is centered around a young Swedish drama student '
         'named Lena who wants to learn everything she can about life. In '
         'particular she wants to focus her attentions to making some sort of '
         'documentary on what the average Swede thought about certain '
         'political issues such as the Vietnam War and race issues in the '
         'United States. In between asking politicians and ordinary denizens '
         'of Stockholm about their opinions on politics, she has sex with her '
         'drama teacher, classmates, and married men.<br

We'll use the [distilgpt2 model](https://huggingface.co/distilgpt2). Let's start with getting the appropriate tokenizer.

In [6]:
pretrained_model = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
special_tokens = tokenizer.special_tokens_map
print(special_tokens)

{'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}


As can be see above it has a special token to indicate end of text, let's use that one when tokenizing our text.

In [7]:
def apply_transform(x):
    return tokenizer(x['text'] + special_tokens['eos_token'], truncation=True)

train_dataset_tok = train_dataset.map(apply_transform, remove_columns=['text', 'label'])
eval_dataset_tok = eval_dataset.map(apply_transform, remove_columns=['text', 'label'])

Map:   0%|          | 0/95000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Let's look at one sample from the training set.

In [13]:
for b in train_dataset_tok:
    pprint(b, compact=True)
    print('Length of input_ids:', len(b['input_ids']))
    break

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1,

Next, we'll group the tokenized text into fixed-length blocks for efficient processing.




In [14]:
max_block_length = 128

def divide_tokenized_text(tokenized_text_dict, block_size):
    concatenated_examples = {k: sum(tokenized_text_dict[k], []) for k in tokenized_text_dict.keys()}
    total_length = len(concatenated_examples[list(tokenized_text_dict.keys())[0]])
    total_length = (total_length // block_size) * block_size

    result = {
        k: [t[i: i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }

    result['labels'] = result['input_ids'].copy()
    return result


train_dataset_batched = train_dataset_tok.map(
    lambda tokenized_text_dict: divide_tokenized_text(tokenized_text_dict, max_block_length),
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/95000 [00:00<?, ? examples/s]

In [17]:
eval_dataset_batched = eval_dataset_tok.map(
    lambda tokenized_text_dict: divide_tokenized_text(tokenized_text_dict, max_block_length),
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/5000 [00:00<?, ? examples/s]

Again, a look at what it produced.

In [15]:
for b in train_dataset_batched:
    pprint(b, compact=True)
    print('Length of input_ids:', len(b['input_ids']))
    break

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [40, 26399, 314, 3001, 327, 47269, 20958, 12, 56, 23304, 3913,
               422, 616, 2008, 3650, 780, 286, 477, 262, 10386, 326, 11191, 340,
               618, 340, 373, 717, 2716, 287, 15904, 13, 314, 635, 2982, 326,
               379, 717, 340, 373, 12000, 416, 471, 13, 50, 13, 17112, 611, 340,
               1683, 3088, 284, 3802, 428, 1499, 11, 4361, 852, 257, 4336, 286,
               7328, 3177, 366, 3642, 46927, 1, 314, 1107, 550, 284, 766, 428,

In [None]:
#from transformers import DataCollatorForLanguageModeling
#tokenizer.pad_token = tokenizer.eos_token
#data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [20]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained(pretrained_model)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

1

In [22]:
training_args = TrainingArguments(
    output_dir="gpt-imdb-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=train_dataset_batched,
    eval_dataset=eval_dataset_batched,
    #data_collator=data_collator,
)

trainer.train()


Epoch,Training Loss,Validation Loss


RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 19776 vs 19668

In [None]:
trainer.save_state()

In [None]:
prompt = "This movie was great"


In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="gpt-imdb-model/checkpoint-500")
generator(prompt)