# GPT-2 Movie Dialogue Fine-Tuning: Training Summary

This notebook outlines the process of fine-tuning a GPT-2 model on a custom movie dialogue dataset for causal language modeling. The key steps involved are as follows:

1. **Model and Tokenizer Setup**:
   - Loaded the GPT-2 model (`gpt2-finetuned-movie-dialog`) and configured the tokenizer, using the `eos_token` as the padding token for input sequences.

2. **Data Preparation**:
   - The movie dialogue dataset was preprocessed to concatenate input and response sequences, creating a causal language modeling format where input and labels are identical.
   - The dataset was split into training (95%) and validation (5%) sets using Hugging Face’s `Dataset` library.

3. **Training Process**:
   - The training was conducted for 10 epochs with a batch size of 64, using the Hugging Face `Trainer` class.
   - Periodic evaluation (every 100 steps) and logging ensured that both training and validation losses were tracked.
   - Weight decay was applied to prevent overfitting and ensure better generalization.

4. **Model Saving and Testing**:
   - After each epoch, the fine-tuned model and tokenizer were saved.
   - The model was tested using a text generation pipeline to produce chatbot-like movie dialogues.

5. **Exporting the Model**:
   - Finally, the trained model was archived and saved to Google Drive for future use.

This notebook demonstrates the full process of fine-tuning a GPT-2 model for generating conversational responses in the context of movie dialogues.

In [None]:
!pip install datasets

In [42]:
import json
from datasets import Dataset
from transformers import Trainer, TrainingArguments, GPT2Tokenizer, AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling
import random

# Load dataset
with open("preprocessed_data.json", "r") as f:
    data = json.load(f)

# Load the "gpt2-medium" model
tokenizer = AutoTokenizer.from_pretrained("./gpt2-finetuned-movie-dialog")
model = AutoModelForCausalLM.from_pretrained("./gpt2-finetuned-movie-dialog")

# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token  # Use eos_token as pad_token for GPT-2

# Prepare the dataset in the correct format
formatted_data = []
for conversation in data:
    input_ids = conversation[0]
    response_ids = conversation[1]

    # Concatenate input and response with a separator (if needed)
    combined_ids = input_ids + [tokenizer.eos_token_id] + response_ids

    formatted_data.append({
        "input_ids": combined_ids,
        "labels": combined_ids  # Labels are the same as input_ids (GPT-2 copies the input)
    })
formatted_data = random.sample(formatted_data, 20100)
# Convert to a Huggingface Dataset
dataset = Dataset.from_list(formatted_data)

# Load your dataset and split it
dataset_split = dataset.train_test_split(test_size=0.05)  # 95% train, 5% validation

# Separate the train and validation datasets
train_dataset = dataset_split["train"]
val_dataset = dataset_split["test"]

# Define the data collator with padding (automatically pads the sequences)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Not using masked language modeling (MLM) since it's causal LM
)

# Verify the format of the dataset
print(train_dataset[0])

{'input_ids': [39532, 1820, 256, 6475, 271, 25147, 25669, 1263, 640, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 10919, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256], 'labels': [39532, 1820, 256, 6475, 271, 25147, 25669, 1263, 640, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 10919, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]}


In [43]:
import shutil
from google.colab import drive
from transformers import pipeline

drive.mount('/content/drive')

for epoch in range(10):
  # Now you can set up the Trainer using the data collator
  training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=1,
    weight_decay=0.01,
    eval_strategy="steps",  # Evaluate every 'eval_steps'
    eval_steps=100,  # Evaluate every __ steps
    logging_steps=100,  # Log training and validation loss every __ steps
    load_best_model_at_end=True,  # If you want to load the best model
  )

  trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
  )

  # Start training
  trainer.train()

  model.save_pretrained("./gpt2-finetuned-movie-dialog")
  tokenizer.save_pretrained("./gpt2-finetuned-movie-dialog")

  print(f"Epoch: {epoch} completed" )
  epoch += 1

  # Load fine-tuned model
  generator = pipeline("text-generation", model="./gpt2-finetuned-movie-dialog", tokenizer=tokenizer)

  # Test the model
  response = generator(
      "You are a friendly chatbot specializing in movie dialog. \nUser: What's your favorite movie? \nChatBot: ",
      max_length=100,
      num_return_sequences=1,
      min_new_tokens = 5,
      no_repeat_ngram_size=2,  # Avoid repetition
  )
  print(response[0]['generated_text'])

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Step,Training Loss,Validation Loss
100,0.3097,7.61192
200,0.68,7.28461


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch: 0 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot:  Favorite movie Favorite bot favorite bot favorites Favorite movies with nick doofus favorite movies Favorite animal movies Favorites favorite sport moviesFavorite food moviesFavoritesfavorite book Favorite music Movies Favorite tv shows Favorite dance Rena favorite music Favorite painting fave song Favorite scene favorite scene with your best friend favorite food scene Favorite moment favorite idea favorite moment with kristin Favorite foodscene favorite


Step,Training Loss,Validation Loss
100,0.2474,7.685878
200,0.541,7.529398


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 1 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot:  Favorite movie Favorite bot favorite music Favorite food Favorite animal Favorite city Favorite actor Favorite actress Favorite musician Favorite bakery Favorite national geographic television station Favorite science fiction novel Favorite political cartoon Favorite spoken word Favorite joke Favorite playbook Favorite dreamy still painting Favorite quote favorite playbooks Favorite poetry Favorite narrative clip favorite idea of his favorite director Favorite song and the bot will answer with a movie from your


Step,Training Loss,Validation Loss
100,0.2034,7.760163
200,0.4353,7.72851


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 2 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot:  Favorite movie Favorite food Favorite drink Favorite song Favorite artistii Favorite book i Favorite musici favorite travelbumperi favored favorite cookingi favorites hobbiesifavorite political statementi favor civil rightsand feminist theories over Nazarene sol why dont you answer my questionand vote for your least favorite film Favorite ice cream brand Favorite bakery what do you prefer your birthday gift Favorite political


Step,Training Loss,Validation Loss
100,0.1773,7.867425
200,0.3643,7.840002


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 3 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot: ive seen information_technology angstrom million times with people of all ages and every age group can learn something from you just like your bot can just think out angstrofic from your answers when you type out movie information volition aid you to have a better time astatine the party and evening if you dont mind the inaccuracy iodine accept your request to continue


Step,Training Loss,Validation Loss
100,0.1554,7.930505
200,0.3112,7.905749


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 4 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot: ive seen information_technology angstrom million times with people of all ages and every walks ang� Strait of Gibraltar manner past with cup holders in both hands and anglers with no hair on theirshins and cups of coffee all over the table youre just ang room right summoner and information isnt excessively loud youve get angroom by yourself iodine dont think thats


Step,Training Loss,Validation Loss
100,0.1371,7.990967
200,0.2706,7.972692


Step,Training Loss,Validation Loss
100,0.1371,7.990967
200,0.2706,7.972692


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 5 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot: ive get to change out of my raggedy old uni tshirt and get nikes for old_age im gonna need some style maam youre the guy for me im try to brand new hair looks good with these new details look at this look astatine this tux this isnt look dinky the kind of girl World_Health_Organ


Step,Training Loss,Validation Loss
100,0.118,8.021007
200,0.2418,8.015536


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 6 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot: ive get to change out of my raggedy old uni tshirt and get nikes for dear life so im gonna need a better image when im done um my lookso whatever it is youre angstrom chat bot youd do better on image no sooner make a movie than four old_age after that theres a conversation to carry out about the movie


Step,Training Loss,Validation Loss
100,0.1018,8.106676
200,0.2186,8.098091


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 7 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot: ive get to change out of my uniform and im not into uniform for user reasons id rather not have to but i digress avant-garde jabba the idea of an intelligent life wholly governed by fear i love the fear of the dark in the movies the sense of community that comes with it the dread in general the light side of which i absolutely love a


Step,Training Loss,Validation Loss
100,0.0889,8.11289
200,0.2016,8.127602


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 8 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot: ive get to change out of my uniform im sick of being lectured like im a dork for not knowing when to leave the room and im ready to kick your arse in the RookieGREETing from the comfort of your computer youd beryllium my best friend if you didnt have to sit through those plebian crap to get there and privation


Step,Training Loss,Validation Loss
100,0.0775,8.162188
200,0.1867,8.136312


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Epoch: 9 completed
You are a friendly chatbot specializing in movie dialog. 
User: What's your favorite movie? 
ChatBot: ive get to change out of my r coat and uggottysu uoughayu for uhmmy favoriteu is indylfrightenenewhey ill watch it with you in a while i got a batch of movies to put on the docket you gotta show me something else once in awhile ill pay you off with a movie once a month


In [44]:
import shutil
from google.colab import drive
drive.mount('/content/drive')
shutil.make_archive(f'gpt2-finetuned-movie-dialog', 'zip', './gpt2-finetuned-movie-dialog')
!cp -r ./gpt2-finetuned-movie-dialog.zip /content/drive/MyDrive/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
