# NPC Dialogue Generator - Project README

###**Overview**

The following project is an npc dialogue generator that generates a greeting, parting, idle and quest_hook dialogue lines when given character information.

The core of the project involves fine-tuning the google/gemma-2b-it model using LoRA (Low-Rank Adaptation) on a custom, generated dataset. This fine-tuning specializes the model for the specific task of generating a 4-key JSON object (greeting, parting, idle, quest_hook) that aligns with a character's input persona.

As this project is a part of the product prototype track, there is no base research code that was taken from external repositories.

###**Dataset**
The Datasets used in this prototype where generated using Google Gemini.

Github Link: https://github.com/czq4smkpyv-source/NPC-Dialogue-Generator-Datasets.git

- Training Dataset: training_characters.json
- Testing Dataset: characters.json

###**Dependencies**

The project was developed in Google Colab and requires the following Python libraries:

- transformers
- torch
- accelerate
- bitsandbytes
- datasets
- hugging face
- peft
- json
- random
- re
- matplotlib.pyplot

These can be installed by running the !pip install ... command in the first cell of the notebook.

###**Running Project**
1. Open the `NPC_Dialogue_Project.ipynb` notebook in Google Colab
2. Add your Hugging Face access token as a Colab Secret named `Hugging_Face_Acess`.
3. Go to the Google Gemma 2B-IT model page (https://huggingface.co/google/gemma-2b-it) and ensure that you have agreed to the license terms.
4. Upload the training and testing Datasets into Colab (Github link above for the datasets)
5. (Optional) Instead of the characters testing dataset, compile a json file of the characters you want to generate dialogue for in the given file format and add it to the `characters.json` write file:
    - "name": ...
    - "job": ...
    - "location": ...
    - "world_type": ...
    - "personality_trait": ...
6. Run all the cells in order.

###**Student-Written Code**

**Data Preprocessing Pipeline (cell 3)**

The data preprocessing section was student written with the assistance of AI.

Helper functions:
- format_character_sheets(): This method was completely student written, used to convert json file information into a dictionary list. This was used for both training data and testing data

- format_direct_mapping_example(): This method was written with the help of AI, used to properly format the training data, seperating the character information input from the dialogue lines for loss calculations.

- tokenize_text(): This method was mostly written by AI with some edits and corrections. This tokenizes the formatted data to make it easier to send to the model

Main Body Script:
- The main body utilizes the helper functions as well as the hugging face methods to produced a tokenized hf_dataset, allowing for it to be sent to the model. This was student written with AI assistance.

**PyTorch Training Loop (cell 5)**

Cell 5 is student written code. This includes the pytorch standard training loop (Zero grad, forward pass, backward pass, opitimizer step) as well as the updating of the model. Setting up the the preprocessed data for training used imported methods from pytorch and transformers. There is also some code to plot the average loss graph.

**Inference Pipeline Logic (Cell 6)**

Helper functions:
- generate_dialogue(): This function is student-written. It defines the core inference pipeline: how the system prompt and character details are combined, how the model is called, and how the raw text output is cleaned and parsed.
- formatting_character_input(): This function is student-written. This formats the dictionary character into a single string for sending to the model.

Main Body Script:
- The loop at the bottom of the cell that iterates through the test characters and calls the generation function is student-written.

###**Adapted from Prior Code**
This category includes standard code required to use the Hugging Face and PyTorch libraries. The structure is adapted from official documentation, while the specific parameters and their integration are part of the student design.

This code is adapted from standard Hugging Face documentation for loading a 4-bit quantized model.
- Library Installation & Imports / Model Loading (Cell 2)
    - BitsAndBytesConfig(...)
    - AutoModelForCausalLM.from_pretrained(...)

This code is adapted from standard Hugging Face PEFT documentation (Cell 4).
- LoRA Configuration


In [None]:
# Installing libraries and Model
!pip install -q transformers torch accelerate datasets peft
!pip install -U bitsandbytes -q

import json
import re
import random
import torch
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from datasets import load_dataset, Dataset
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from google.colab import userdata

try:
    hf_token = userdata.get('Hugging_Face_Acess')
    login(token=hf_token)
    print("Authenticated with Hugging Face.")
except Exception as e:
    print(e)
    print("Authentication failed or secret not found.")

# Define model and device variables
model_id = "google/gemma-2b-it"
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"\nLoading model: {model_id} onto device: {device}")

# Configure for 4-BIT
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16)

# Load the Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Load the Base Model (with 4-bit config)
model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=quantization_config, device_map="auto")

print("\nSetup complete! Base model is loaded in 4-bit and ready.")

In [None]:
# Data preprocessing
print("Loading pre-generated training data from custom character file")

# Define the path to your custom training JSON file
training_file_path = "training_characters.json"

# Function to load the training data from the JSON file
def format_character_sheets(file):
  try:
    with open(file, 'r') as f:
      character_sheets = json.load(f) # Load the entire JSON structure into a dictionary
      return character_sheets['characters']
  except FileNotFoundError: #
      print(f"Error: The file at '{file}' was not found.")
      return []
  except json.JSONDecodeError:
      print(f"Error: The file at '{file}' is not valid JSON.")
      return []
  except KeyError:
      print(f"Error: The JSON file at '{file}' is missing the required 'characters' key.")
      return []

# Formatting Function for Training
def format_direct_mapping_example(character_obj):
    trait_info = ""
    for key, value in character_obj.items():
        # Exclude the target dialogues from input prompt
        if key != "dialogue_lines":
              formatted_key = key.replace('_', ' ').title()
              trait_info += f"{formatted_key}: {value}\n"
    trait_info = trait_info.strip()
    if not trait_info: return None

    # Format Target Output Dialogue
    target_dialogues = character_obj.get("dialogue_lines")
    if not target_dialogues or not isinstance(target_dialogues, dict): return None

    processed_dialogues = {}
    required_keys = {"greeting", "parting", "idle", "quest_hooks"}
    for key in required_keys:
        if key not in target_dialogues: return None
        dialogue_options = target_dialogues[key]
        if isinstance(dialogue_options, list) and dialogue_options:
            processed_dialogues[key] = random.choice(dialogue_options)
        elif isinstance(dialogue_options, str):
            processed_dialogues[key] = dialogue_options
        else: return None

    try:
        # Create the target JSON string
        assistant_content = json.dumps(processed_dialogues, indent=None)
        json.loads(assistant_content)
    except (TypeError, json.JSONDecodeError): return None

    # Combine Input and Output into a single text block
    combined_text = f"Character Traits:\n{trait_info}\n\nGenerated Dialogue JSON:\n{assistant_content}"
    return {"text": combined_text}

# Tokenization function
def tokenize_text(example):
    try:
        if 'tokenizer' not in globals(): raise NameError("Tokenizer not defined.")
        # Add EOS token to signal end of generation during training
        text_with_eos = example["text"] + tokenizer.eos_token
        tokenized_output = tokenizer(
            text_with_eos,
            truncation=True,
            max_length=512,
            return_tensors=None
        )
        if not tokenized_output or not tokenized_output.get("input_ids"): return None
        return tokenized_output
    except Exception as e:
        print(f"Error during tokenization for an example: {e}")
        return None

# Load the data
character_data_list = format_character_sheets(training_file_path)

# Initialize tokenized_dataset to None in case loading fails
tokenized_dataset = None

print(f"Loaded {len(character_data_list)} characters from {training_file_path}")

# Apply formatting function
print("Formatting data into combined text examples")
formatted_examples = [format_direct_mapping_example(char) for char in character_data_list]
formatted_examples = [ex for ex in formatted_examples if ex is not None]

if not formatted_examples:
    print("No valid examples remained after formatting. Cannot proceed.")
else:
    print(f"Formatted {len(formatted_examples)} valid training examples")

    # Convert to Hugging Face Dataset
    try:
        hf_dataset = Dataset.from_dict({"text": [ex["text"] for ex in formatted_examples]})
        print("Converted formatted data into Hugging Face Dataset")
    except Exception as e:
          print(f"Unexpected error converting data to Dataset object: {e}")
          hf_dataset = None

    if hf_dataset:
        # Tokenize the dataset
        print("Tokenizing the dataset")
        tokenized_dataset = hf_dataset.map(tokenize_text, remove_columns=['text'])
        tokenized_dataset = tokenized_dataset.filter(lambda x: x is not None)

        print(f"Tokenized {len(tokenized_dataset)} valid training examples")
    else:
        print("\nHalting: Could not create Dataset object from formatted data.")



In [None]:
# Lora Config

print("Applying LoRA configuration to base model")

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)

# Prepare model for k-bit training
model_for_training = prepare_model_for_kbit_training(model)
finetune_model = get_peft_model(model_for_training, lora_config)

# Print trainable parameters to see the efficiency of LoRA
finetune_model.print_trainable_parameters()

print("Model configured with LoRA.")



In [None]:
# Training Loop
print("Setting up optimizer and DataLoader")

# Set Up Optimizer
optimizer = torch.optim.AdamW(finetune_model.parameters(), lr=2e-4) # Learning rate common for LoRA

# Set Up DataLoader
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Check if tokenized_dataset exists and is not empty before creating DataLoader
train_dataloader = DataLoader(
    tokenized_dataset,
    batch_size=4,
    shuffle=True,
    collate_fn=data_collator
)

# The Training Loop
num_epochs = 4

# Set model to train
finetune_model.train()

loss_plot_data = []

print("Starting fine-tuning")
for epoch in range(num_epochs):
    total_loss = 0
    num_batches = len(train_dataloader)
    for i, batch in enumerate(train_dataloader):

        # Zero gradients
        optimizer.zero_grad()

        # Explicitly move batch data to the correct device
        batch = {k: v.to(device) for k, v in batch.items() if hasattr(v, 'to')}

        # Forward pass: compute model output and loss
        outputs = finetune_model(**batch)
        loss = outputs.loss

        # Check if loss is valid
        if loss is None or not torch.isfinite(loss):
            print(f"Invalid loss at batch {i+1}.")
            optimizer.zero_grad()
            continue

        total_loss += loss.item()

        # Backward pass
        loss.backward()

        torch.nn.utils.clip_grad_norm_(finetune_model.parameters(), max_norm=1.0)

        # Optimizer step
        optimizer.step()

        # Print progress periodically
        if (i + 1) % 10 == 0:
            print(f"Epoch: {epoch + 1}, Batch: {i+1}/{num_batches}, Loss: {loss.item():.4f}")

    # Calculate and print average loss for the epoch
    if num_batches > 0:
        avg_loss = total_loss / num_batches
        loss_plot_data.append(avg_loss)
        print(f"Epoch: {epoch + 1}, Average Loss: {avg_loss:.4f} ")
    else:
        print(f"Epoch: {epoch + 1}, No batches processed")


print("Training complete!")

# Save the trained adapters to the model
output_dir = "gemma-npc-finetuned-adapters"
finetune_model.eval()
finetune_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)


if 'loss_plot_data' in locals() and len(loss_plot_data) > 0:
    epochs = range(1, len(loss_plot_data) + 1)

    # Create the plot
    plt.figure(figsize=(10, 6))
    plt.plot(epochs, loss_plot_data, 'b-o', label='Training Loss')

    # Add labels and a title
    plt.title('Training Loss per Epoch')
    plt.xlabel('Epoch')
    plt.ylabel('Average Loss')
    plt.xticks(epochs)
    plt.legend()
    plt.grid(True)

    plt.show()
else:
    print("No loss history found. Please run the training cell (Cell 5) first.")




In [None]:
# Outputs Character Dialogue

# Formats a single character dictionary into a readable string
def formatting_character_input(character):
  information = ""
  for key, value in character.items():
      information += f"{key.replace('_', ' ').title()}: {value}\n"
  return information.strip()

# Generates Dialogue
def generate_dialogue(instructions, character_details, model, tokenizer):
  # Combines Prompt instructions with the character details
  prompt = instructions + "\nCharacter Information:\n" + character_details + "\n\nJSON OUTPUT:\n```json\n"

  # Passes prompt to the model and recieves model response
  inputs = tokenizer(prompt, return_tensors="pt").to(device)
  outputs = model.generate(
      **inputs,
      max_new_tokens=250,
      temperature=0.72,
      do_sample=True
  )

  # Decodes the model response
  generated_dialogue = tokenizer.decode(outputs[0], skip_special_tokens=True)
  json_response = generated_dialogue.split("```json")[-1].replace("```", "").strip()
  return json_response

# System Instructions
SYSTEM_INSTRUCTION = """You are a dialogue writer for a video game.
Your task is to generate four lines of dialogue unique to the character: a greeting,
a parting line, an idle comment, and a quest hook related to the character's job.
Respond ONLY with a valid JSON object containing a "greeting", "parting", "idle", and "quest_hook" key."""

print("Running Test Characters")
character_file = 'characters.json'
all_characters = format_character_sheets(character_file)

if all_characters:
    print(f"Successfully loaded {len(all_characters)} character sheets.\n")
    for char in all_characters:
        details = formatting_character_input(char)

        # Generates the Dialogues
        final_json = generate_dialogue(SYSTEM_INSTRUCTION, details, model, tokenizer)
        final_json = re.sub(r',\s*([}\]])', r'\1', final_json)

        print(f"Character Sheet for {char['name']}")
        try:
          data = json.loads(final_json)
          for key, value in data.items():
            print(f"  {key.capitalize()}: {value}")
        except json.JSONDecodeError:
          print("ERROR: Model did not return valid JSON.")
          print(f"Raw output: {final_json}")
    print("\n")
else:
    print("No character sheets were loaded.")

