# üé¨üéØ Sitcom Script Evaluation

This notebook evaluates pilot episodes of a sitcom generated using two different approaches: a baseline model and a ReAct-based multi-agent system. Each episode consists of 20 scenes, divided into 5 blocks of 4 scenes. The evaluations are performed by a language model (LLM), which scores each block using a structured rubric covering **Coherence**, **Relevance**, **Interestingness**, **Humor**, and **Overall Quality**. For context, the LLM is given both the current scene block and the preceding block‚Äôs scene descriptions. Results are compiled and compared across episodes to assess which generation method produces stronger sitcom writing.

## Mounting Drive and Appending System

In [1]:
# Import necessary modules
from google.colab import drive
import sys

# Mount Google Drive to access files stored in your Google Drive
drive.mount('/content/drive')

# NOTE: Update the paths below to match the location of your project files in Google Drive.
# Replace with your own directory if different.

# Add the main 'utils' directory to Python's module search path
# This allows you to import custom utility modules from this folder
sys.path.append("/content/drive/MyDrive/Spring 2025/Gen AI with LLM/Project/utils")

# Add the 'agents' subdirectory to the module search path
# Useful if you have agent-specific Python scripts you want to import
sys.path.append("/content/drive/MyDrive/Spring 2025/Gen AI with LLM/Project/utils/agents")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Importing Libraries and Packages

In [2]:
# Import utility functions for processing and evaluating sitcom scripts
from text_utils import (
    get_episode_concept,
    get_scene_description_block,
    extract_and_partition_scenes
)

from eval_utils import (
    evaluate_scene_block,
    evaluate_episode_blocks,
    extract_evaluation_scores
)

In [3]:
# Import core libraries for API access, environment interaction, and data analysis
from openai import OpenAI
from google.colab import userdata
import os
import numpy as np
import pandas as pd

### üîë Initializing the OpenAI API Client

In [4]:
api_key = userdata.get('OPENAI_API_KEY')
client = OpenAI(api_key=api_key)

### Reading Title, Episode Outline, and Baseline and ReAct Scripts

Below, we read in the title, episode outline, and baseline and ReAct scripts for the five episodes.

In [5]:
# Set base directory
base_path = "/content/drive/MyDrive/Spring 2025/Gen AI with LLM/Project/pilot episode data"

# Load and assign variables dynamically
for i in range(1, 6):
    folder = f"pilot episode {i}"
    folder_path = os.path.join(base_path, folder)

    try:
        # Sitcom title
        with open(os.path.join(folder_path, f"sitcom_title_{i}.txt"), "r") as f:
            globals()[f"sitcom_title_{i}"] = f.read().strip()

        # Episode outline
        with open(os.path.join(folder_path, f"episode_outline_{i}.txt"), "r") as f:
            globals()[f"episode_outline_{i}"] = f.read()

        # Baseline pilot script
        with open(os.path.join(folder_path, f"baseline_pilot_ep_scrpt_{i}.txt"), "r") as f:
            globals()[f"baseline_pilot_ep_scrpt_{i}"] = f.read()

        # ReAct pilot script
        with open(os.path.join(folder_path, f"react_pilot_ep_scrpt_{i}.txt"), "r") as f:
            globals()[f"react_pilot_ep_scrpt_{i}"] = f.read()

    except FileNotFoundError as e:
        print(f"‚ö†Ô∏è Missing file in episode {i}: {e}")

### Displaying First 200 Characters for Episode Title, Outline, Baseline and ReActScripts

In [6]:
for i in range(1, 6):
    print(f"\n{'='*30} EPISODE {i} {'='*30}\n")

    # Sitcom title
    print(f"Sitcom Title {i}:\n{globals()[f'sitcom_title_{i}']}\n")

    # Episode outline preview
    print(f"Episode Outline {i} (first 200 chars):\n{globals()[f'episode_outline_{i}'][:200]}\n")

    # Baseline script preview
    print(f"{'-'*10} BASELINE SCRIPT (first 200 chars) {'-'*10}")
    print(globals()[f'baseline_pilot_ep_scrpt_{i}'][:200], "\n")

    # ReAct script preview
    print(f"{'-'*10} REACT SCRIPT (first 200 chars) {'-'*10}")
    print(globals()[f'react_pilot_ep_scrpt_{i}'][:200], "\n")



Sitcom Title 1:
Unlocked Potential

Episode Outline 1 (first 200 chars):
Episode Concept: Frankie's estranged daughter, Isabella, moves to NYC and unexpectedly becomes the manager of his locksmith shop; their quirky dynamic is tested when Frankie's parole officer, Gary, st

---------- BASELINE SCRIPT (first 200 chars) ----------
### Scene 1 ###
# Scene 1

INT. FRANKIE'S LOCKSMITH SHOP - MORNING

FRANKIE, a slightly grizzled but spry locksmith with a twinkle in his eye, flips the sign on his shop door from "CLOSED" to "OPEN."
 

---------- REACT SCRIPT (first 200 chars) ----------
### Scene 1 ###
# Scene 1

INT. FRANKIE'S LOCKSMITH SHOP - DAY

FRANKIE, a middle-aged man with a big personality and a bigger toolbox, flips the sign on the door from ‚ÄúCLOSED‚Äù to ‚ÄúOPEN.‚Äù

FRANKIE:
(c 



Sitcom Title 2:
Key Changes

Episode Outline 2 (first 200 chars):
Episode Concept: In the pilot episode, "A Fresh Start", ex-con Jimmy takes over his late uncle's locksmith business and his estrange

### Extracting Episode Concepts
Below, we extract the episode concept for each of the five episodes.
Since both the Baseline and ReAct versions share the same concept, a single concept per episode is extracted. These concepts will be used as context during scene block evaluation.

In [7]:
# Extract episode concepts for episodes 1 through 5
for i in range(1, 6):
    outline = globals()[f"episode_outline_{i}"]
    concept = get_episode_concept(outline)
    globals()[f"episode_concept_{i}"] = concept

# Print to verify
for i in range(1, 6):
    print(f"\nEpisode {i} Concept:\n{globals()[f'episode_concept_{i}']}")


Episode 1 Concept:
Frankie's estranged daughter, Isabella, moves to NYC and unexpectedly becomes the manager of his locksmith shop; their quirky dynamic is tested when Frankie's parole officer, Gary, stirs up trouble.

Episode 2 Concept:
In the pilot episode, "A Fresh Start", ex-con Jimmy takes over his late uncle's locksmith business and his estranged daughter, Lisa, shows up to help, while their quirky parole officer, Stanley, introduces himself with a Broadway-inspired entrance.

Episode 3 Concept:
** Jimmy's estranged daughter, Sophie, unexpectedly shows up at his locksmith shop, the same day he receives a mysterious package from his past. Meanwhile, his parole officer, Earl, is dealing with his own issues, leading to a day full of chaos, laughter, and heart.

Episode 4 Concept:
Jimmy, a former convict, opens a locksmith shop in NYC and is unexpectedly joined by his business school dropout daughter, Samantha. Their first day of running the business together is filled with challeng

### Scene Description Blocks

A potential limitation of evaluating every 4 scenes in isolation is that, for the first block (Scenes 1‚Äì4), the language model lacks prior narrative context beyond the episode concept.To improve coherence evaluations, we provide the model with scene descriptions from the previous 4 scenes as additional context‚Äîstarting from Block 2 onward.

Below, we extract 5 blocks of 4 scene descriptions for each of the 5 episode outlines, resulting in a total of 25 blocks.

In [8]:
# Shared scene descriptions for Episodes 1‚Äì5 (used by both Baseline and ReAct)
scene_description_blocks = {}

for i in range(1, 6):  # Episodes 1 to 5
    blocks = get_scene_description_block(globals()[f"episode_outline_{i}"], block_size=4)
    scene_description_blocks[i] = {
        f"scene_block_{j+1}": block
        for j, block in enumerate(blocks)
    }


### Extracting Scenes
We extract 5 blocks of 4 scenes for each episode.
This process is performed for both the Baseline and ReAct versions across 5 episodes, resulting in a total of 50 scene blocks.

In [9]:
# Store all baseline and ReAct scene blocks by episode number
baseline_script_blocks = {}
react_script_blocks = {}

for i in range(1, 6):
    # ----- Baseline -----
    baseline_script = globals()[f"baseline_pilot_ep_scrpt_{i}"]
    baseline_blocks = extract_and_partition_scenes(baseline_script, block_size=4)

    baseline_script_blocks[i] = {
        f"scene_block_{j+1}": block for j, block in enumerate(baseline_blocks.values())
    }
    for key, block in baseline_script_blocks[i].items():
        globals()[f"{key}_ep{i}_baseline"] = block

    # ----- ReAct -----
    react_script = globals().get(f"react_pilot_ep_scrpt_{i}")
    if react_script:
        react_blocks = extract_and_partition_scenes(react_script, block_size=4)

        react_script_blocks[i] = {
            f"scene_block_{j+1}": block for j, block in enumerate(react_blocks.values())
        }
        for key, block in react_script_blocks[i].items():
            globals()[f"{key}_ep{i}_react"] = block
    else:
        print(f"‚ö†Ô∏è ReAct script for episode {i} not found")

## ReAct Evaluation

### Episode 1

In [10]:
# Evaluate ReAct Episode 1 using scene descriptions and generated script blocks
react_evaluations_ep1 = evaluate_episode_blocks(
    client=client,
    episode_num=1,
    script_blocks=react_script_blocks[1],
    description_blocks=scene_description_blocks[1],
    episode_concept=episode_concept_1
)

In [11]:
# Access and print all scene block evaluations for Episode 1 (ReAct version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (ReAct) {'=' * 20}\n")
    print(react_evaluations_ep1[f"scene_block_{i}"])



- Coherence: 8 ‚Äì The scenes flow logically from one to the next, with a clear progression of events. However, there is some confusion in Scene 3 where Mr. Bianchi is introduced, but it's unclear if this is Frankie's alias or a different character. This inconsistency slightly disrupts the coherence.
- Relevance: 9 ‚Äì The scenes support the episode concept well, introducing the main characters and setting up their dynamics. The scenes also hint at the character arcs, with Isabella's adaptability and Frankie's laid-back attitude. However, the subplot with Frankie's parole officer, Gary, is not introduced in these scenes.
- Interestingness: 7 ‚Äì The scenes are engaging and set up an interesting dynamic between Frankie and Isabella. However, the narrative could be more dynamic with the introduction of more conflict or unexpected events.
- Humor: 8 ‚Äì The comedy is well-timed and character-driven, with humor arising from the characters' interactions and personalities. However, the hum

In [12]:
# Extract scores into a DataFrame for ReAct Episode 1
react_scores_ep1_df = extract_evaluation_scores(
    react_evaluations_ep1,
    episode_number=1,
    version_label="ReAct"
)

display(react_scores_ep1_df)

Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,1,ReAct,scene_block_1,Coherence,8,The scenes flow logically from one to the next...
1,1,ReAct,scene_block_1,Relevance,9,"The scenes support the episode concept well, i..."
2,1,ReAct,scene_block_1,Interestingness,7,The scenes are engaging and set up an interest...
3,1,ReAct,scene_block_1,Humor,8,"The comedy is well-timed and character-driven,..."
4,1,ReAct,scene_block_1,Overall Quality,8,"The structure of the scenes is solid, and the ..."
5,1,ReAct,scene_block_2,Coherence,6,"The scenes are generally coherent, but there a..."
6,1,ReAct,scene_block_2,Relevance,7,The scenes are relevant to the episode concept...
7,1,ReAct,scene_block_2,Interestingness,8,The scenes are engaging and full of character-...
8,1,ReAct,scene_block_2,Humor,8,"The humor is well-timed and character-driven, ..."
9,1,ReAct,scene_block_2,Overall Quality,7,The scenes are well-structured and the tone is...


### Episode 2

In [13]:
# Evaluate ReAct Episode 2 using scene descriptions and generated script blocks
react_evaluations_ep2 = evaluate_episode_blocks(
    client=client,
    episode_num=2,
    script_blocks=react_script_blocks[2],
    description_blocks=scene_description_blocks[2],
    episode_concept=episode_concept_2,
)

In [14]:
# Access and print all scene block evaluations for Episode 2 (ReAct version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (ReAct) {'=' * 20}\n")
    print(react_evaluations_ep2[f"scene_block_{i}"])



- Coherence: 8 ‚Äì The scenes flow logically from one to the next, maintaining a consistent narrative. However, the introduction of Lisa in Scene 3 could have been better established, as her relationship to Jimmy is not immediately clear.
- Relevance: 9 ‚Äì The scenes support the episode concept well, showing Jimmy's struggles and triumphs as a locksmith. The character arcs are also well established, with Jimmy's relationship with Lisa and his interactions with his customers and neighbors providing depth to his character.
- Interestingness: 7 ‚Äì The scenes are engaging and provide a good mix of character development and humor. However, the narrative could be more dynamic, with more unexpected twists or conflicts to keep the audience engaged.
- Humor: 8 ‚Äì The comedy is well-timed and character-driven, with Jimmy's lock-related puns and interactions with his customers providing much of the humor. The humor could be more varied, however, as the reliance on puns and slapstick might be

In [15]:
# Extract scores into a DataFrame for ReAct Episode 2
react_scores_ep2_df = extract_evaluation_scores(
    react_evaluations_ep2,
    episode_number=2,
    version_label="ReAct"
)

display(react_scores_ep2_df)

Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,2,ReAct,scene_block_1,Coherence,8,The scenes flow logically from one to the next...
1,2,ReAct,scene_block_1,Relevance,9,"The scenes support the episode concept well, s..."
2,2,ReAct,scene_block_1,Interestingness,7,The scenes are engaging and provide a good mix...
3,2,ReAct,scene_block_1,Humor,8,"The comedy is well-timed and character-driven,..."
4,2,ReAct,scene_block_1,Overall Quality,8,"The structure, tone, and genre fit of the scen..."
5,2,ReAct,scene_block_2,Coherence,8,The scenes flow logically from one to the next...
6,2,ReAct,scene_block_2,Relevance,9,"The scenes support the episode concept, charac..."
7,2,ReAct,scene_block_2,Interestingness,7,The scenes are engaging and narratively dynami...
8,2,ReAct,scene_block_2,Humor,7,"The comedy is well-timed and character-driven,..."
9,2,ReAct,scene_block_2,Overall Quality,8,"The structure, tone, and genre fit are all sol..."


### Episode 3

In [16]:
# Evaluate ReAct Episode 3 using scene descriptions and generated script blocks
react_evaluations_ep3 = evaluate_episode_blocks(
    client=client,
    episode_num=3,
    script_blocks=react_script_blocks[3],
    description_blocks=scene_description_blocks[3],
    episode_concept=episode_concept_3,
)

In [17]:
# Access and print all scene block evaluations for Episode 3 (ReAct version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (ReAct) {'=' * 20}\n")
    print(react_evaluations_ep3[f"scene_block_{i}"])



- Coherence: 8 ‚Äì The scenes flow logically from one to the next, maintaining a consistent tone and setting. However, the introduction of Sophie, Jimmy's estranged daughter, is missing from these scenes, which is a key part of the episode's concept.
- Relevance: 7 ‚Äì The scenes support the episode concept of Jimmy's life as a locksmith and his interactions with his friends and neighbors. However, the character arcs and prior developments are not clearly established in these scenes. The mysterious package from Jimmy's past is introduced, but its significance is not fully explored.
- Interestingness: 8 ‚Äì The scenes are original and engaging, with a variety of comedic situations and character interactions. The use of a locksmith shop as the main setting is unique and provides opportunities for interesting narrative dynamics. However, the scenes could benefit from more dramatic tension or conflict to heighten the stakes.
- Humor: 9 ‚Äì The comedy is well-timed and character-driven, w

In [18]:
# Extract scores into a DataFrame for ReAct Episode 3
react_scores_ep3_df = extract_evaluation_scores(
    react_evaluations_ep3,
    episode_number=3,
    version_label="ReAct"
)

display(react_scores_ep3_df)

Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,3,ReAct,scene_block_1,Coherence,8,The scenes flow logically from one to the next...
1,3,ReAct,scene_block_1,Relevance,7,The scenes support the episode concept of Jimm...
2,3,ReAct,scene_block_1,Interestingness,8,"The scenes are original and engaging, with a v..."
3,3,ReAct,scene_block_1,Humor,9,"The comedy is well-timed and character-driven,..."
4,3,ReAct,scene_block_1,Overall Quality,8,The scenes are well-structured and the tone is...
5,3,ReAct,scene_block_2,Coherence,9,The scenes flow logically from one to the next...
6,3,ReAct,scene_block_2,Relevance,10,Each scene supports the episode concept and ch...
7,3,ReAct,scene_block_2,Interestingness,8,"The scenes are engaging and original, with a g..."
8,3,ReAct,scene_block_2,Humor,9,"The comedy is well-timed and character-driven,..."
9,3,ReAct,scene_block_2,Overall Quality,9,"The structure, tone, and genre fit are all exc..."


### Episode 4

In [19]:
# Evaluate ReAct Episode 4 using scene descriptions and generated script blocks
react_evaluations_ep4 = evaluate_episode_blocks(
    client=client,
    episode_num=4,
    script_blocks=react_script_blocks[4],
    description_blocks=scene_description_blocks[4],
    episode_concept=episode_concept_4,
)

In [20]:
# Access and print all scene block evaluations for Episode 4 (ReAct version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (ReAct) {'=' * 20}\n")
    print(react_evaluations_ep4[f"scene_block_{i}"])



- Coherence: 7 ‚Äì The scenes generally flow logically, but there are some inconsistencies. For example, in Scene 2, the setting is described as "Jimmy's Novelty Shop," but in all other scenes, it's a locksmith shop. Also, in Scene 1, Lucy is introduced as a new employee, but she doesn't appear in the following scenes.
- Relevance: 8 ‚Äì The scenes support the episode concept and character arcs well. We see Jimmy's transition from a solo locksmith to a business partner with his daughter, Samantha. Samantha's eco-friendly ideas and Jimmy's old-school approach create a nice dynamic. However, Lucy's introduction in Scene 1 doesn't seem to have any follow-up.
- Interestingness: 8 ‚Äì The scenes are engaging and narratively dynamic. The introduction of Samantha's eco-friendly ideas and Terry's eccentric character add interesting elements to the story. However, the scenes could benefit from more conflict or unexpected turns.
- Humor: 7 ‚Äì The humor is character-driven and appropriately va

In [21]:
# Extract scores into a DataFrame for ReAct Episode 4
react_scores_ep4_df = extract_evaluation_scores(
    react_evaluations_ep4,
    episode_number=4,
    version_label="ReAct"
)

display(react_scores_ep4_df)

‚ö†Ô∏è Skipped unmatched line in scene_block_1: - Overall Quality: 7.5 ‚Äì The scenes are well-structured and the tone is consistent, fitting the sitcom genre well. The writing is polished, but there are some inconsistencies and missed opportunities for more original humor and interesting plot developments.


Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,4,ReAct,scene_block_1,Coherence,7,"The scenes generally flow logically, but there..."
1,4,ReAct,scene_block_1,Relevance,8,The scenes support the episode concept and cha...
2,4,ReAct,scene_block_1,Interestingness,8,The scenes are engaging and narratively dynami...
3,4,ReAct,scene_block_1,Humor,7,The humor is character-driven and appropriatel...
4,4,ReAct,scene_block_2,Coherence,9,The scenes flow logically from one to the next...
5,4,ReAct,scene_block_2,Relevance,10,Each scene supports the episode concept and bu...
6,4,ReAct,scene_block_2,Interestingness,8,"The scenes are engaging and original, with a g..."
7,4,ReAct,scene_block_2,Humor,8,"The comedy is well-timed and character-driven,..."
8,4,ReAct,scene_block_2,Overall Quality,9,"The scenes are well-structured, with a good ba..."
9,4,ReAct,scene_block_3,Coherence,8,The scenes flow logically from one to the next...


### Episode 5

In [22]:
# Evaluate ReAct Episode 5 using scene descriptions and generated script blocks
react_evaluations_ep5 = evaluate_episode_blocks(
    client=client,
    episode_num=5,
    script_blocks=react_script_blocks[5],
    description_blocks=scene_description_blocks[5],
    episode_concept=episode_concept_5
)

In [23]:
# Access and print all scene block evaluations for Episode 5 (ReAct version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (ReAct) {'=' * 20}\n")
    print(react_evaluations_ep5[f"scene_block_{i}"])



- Coherence: 9 ‚Äì The scenes flow logically from one to the next, maintaining a consistent narrative. The characters' actions and reactions are believable and consistent with their established personalities. The only minor issue is the sudden introduction of the parrot in the last scene, which could have been foreshadowed earlier.
  
- Relevance: 10 ‚Äì The scenes are highly relevant to the episode concept, character arcs, and prior developments. They establish Frankie's new life as a locksmith, his relationship with Earl and Len, and his struggle to adapt to his new circumstances. They also introduce the subplot of Frankie's struggle to master locksmithing, which could be a recurring theme in future episodes.

- Interestingness: 8 ‚Äì The scenes are engaging and narratively dynamic, with a good mix of dialogue, action, and character development. However, they could be more original, as the concept of an ex-convict adjusting to life on the outside is a common trope in sitcoms.

- Hu

In [24]:
# Extract scores into a DataFrame for ReAct Episode 5
react_scores_ep5_df = extract_evaluation_scores(
    react_evaluations_ep5,
    episode_number=5,
    version_label="ReAct"
)

display(react_scores_ep5_df)

Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,5,ReAct,scene_block_1,Coherence,9,The scenes flow logically from one to the next...
1,5,ReAct,scene_block_1,Relevance,10,The scenes are highly relevant to the episode ...
2,5,ReAct,scene_block_1,Interestingness,8,The scenes are engaging and narratively dynami...
3,5,ReAct,scene_block_1,Humor,7,"The comedy is well-timed and character-driven,..."
4,5,ReAct,scene_block_1,Overall Quality,8,The scenes are well-structured and well-writte...
5,5,ReAct,scene_block_2,Coherence,7,"The scenes generally flow logically, but there..."
6,5,ReAct,scene_block_2,Relevance,8,The scenes support the episode concept and cha...
7,5,ReAct,scene_block_2,Interestingness,9,"The scenes are original and engaging, with a g..."
8,5,ReAct,scene_block_2,Humor,8,"The comedy is well-timed and character-driven,..."
9,5,ReAct,scene_block_2,Overall Quality,8,The scenes are well-structured and well-writte...


## Baseline Evaluation

In [25]:
# Evaluate Baseline Episode 1 using scene descriptions and generated script blocks
baseline_evaluations_ep1 = evaluate_episode_blocks(
    client=client,
    episode_num=1,
    script_blocks=baseline_script_blocks[1],
    description_blocks=scene_description_blocks[1],
    episode_concept=episode_concept_1
)

In [26]:
# Access and print all scene block evaluations for episode 1 (Baseline version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (Baseline) {'=' * 20}\n")
    print(baseline_evaluations_ep1[f"scene_block_{i}"])




- Coherence: 9 ‚Äì The scenes flow logically from one to the next, maintaining a consistent narrative. The introduction of characters and their interactions are well-paced and coherent. The only minor issue is the abrupt introduction of Isabella in Scene 3, which could have been hinted at earlier.
  
- Relevance: 10 ‚Äì The scenes are highly relevant to the episode concept. They introduce the main characters, Frankie and Isabella, and set up their dynamic. The scenes also establish the setting of the locksmith shop and the challenges that come with it. 

- Interestingness: 8 ‚Äì The scenes are engaging and narratively dynamic. The introduction of Isabella and her subsequent interaction with Frankie adds an interesting twist to the story. However, the scenes could benefit from more unexpected or surprising elements to keep the audience on their toes.

- Humor: 7 ‚Äì The humor is character-driven and well-timed, with a good mix of situational and dialogue-based comedy. However, some of

In [27]:
# Extract scores into a DataFrame for Baseline Episode 1
baseline_scores_ep1_df = extract_evaluation_scores(
    baseline_evaluations_ep1,
    episode_number=1,
    version_label="Baseline"
)

display(baseline_scores_ep1_df)

Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,1,Baseline,scene_block_1,Coherence,9,The scenes flow logically from one to the next...
1,1,Baseline,scene_block_1,Relevance,10,The scenes are highly relevant to the episode ...
2,1,Baseline,scene_block_1,Interestingness,8,The scenes are engaging and narratively dynami...
3,1,Baseline,scene_block_1,Humor,7,"The humor is character-driven and well-timed, ..."
4,1,Baseline,scene_block_1,Overall Quality,8,The scenes are well-structured and the tone is...
5,1,Baseline,scene_block_2,Coherence,9,The scenes flow logically from one to the next...
6,1,Baseline,scene_block_2,Relevance,10,The scenes are highly relevant to the episode ...
7,1,Baseline,scene_block_2,Interestingness,8,The scenes are engaging and narratively dynami...
8,1,Baseline,scene_block_2,Humor,8,"The comedy is well-timed and character-driven,..."
9,1,Baseline,scene_block_2,Overall Quality,8,"The scenes are well-structured, with a clear p..."


### Episode 2

In [28]:
# Evaluate Baseline Episode 2 using scene descriptions and generated script blocks
baseline_evaluations_ep2 = evaluate_episode_blocks(
    client=client,
    episode_num=2,
    script_blocks=baseline_script_blocks[2],
    description_blocks=scene_description_blocks[2],
    episode_concept=episode_concept_2
)

In [29]:
# Access and print all scene block evaluations for Episode 2 (Baseline version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (Baseline) {'=' * 20}\n")
    print(baseline_evaluations_ep2[f"scene_block_{i}"])



- Coherence: 9 ‚Äì The scenes flow logically from one to the next, maintaining a consistent narrative. The introduction of Lisa is a bit abrupt, but it's a necessary surprise for the plot.
- Relevance: 8 ‚Äì The scenes support the episode concept and character arcs well. Jimmy's struggle with the keys and his interaction with Mrs. O'Leary establish his character and the setting. Lisa's introduction and her offer to help modernize the shop introduce a new dynamic. However, the parole officer Stanley is missing from these scenes.
- Interestingness: 7 ‚Äì The scenes are engaging and narratively dynamic, with a good mix of humor and emotional moments. The surprise introduction of Lisa adds an interesting twist. However, the scenes could benefit from more conflict or unexpected developments.
- Humor: 7 ‚Äì The humor is character-driven and well-timed, with a good mix of physical comedy and witty dialogue. However, some of the jokes are a bit predictable and the laugh track can feel forced

In [30]:
# Extract scores into a DataFrame for Baseline Episode 2
baseline_scores_ep2_df = extract_evaluation_scores(
    baseline_evaluations_ep2,
    episode_number=2,
    version_label="Baseline"
)

display(baseline_scores_ep2_df)

Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,2,Baseline,scene_block_1,Coherence,9,The scenes flow logically from one to the next...
1,2,Baseline,scene_block_1,Relevance,8,The scenes support the episode concept and cha...
2,2,Baseline,scene_block_1,Interestingness,7,The scenes are engaging and narratively dynami...
3,2,Baseline,scene_block_1,Humor,7,"The humor is character-driven and well-timed, ..."
4,2,Baseline,scene_block_1,Overall Quality,8,"The structure, tone, and genre fit are all sol..."
5,2,Baseline,scene_block_2,Coherence,9,The scenes flow logically from one to the next...
6,2,Baseline,scene_block_2,Relevance,10,"Each scene supports the episode concept, chara..."
7,2,Baseline,scene_block_2,Interestingness,8,The scenes are engaging and narratively dynami...
8,2,Baseline,scene_block_2,Humor,8,"The comedy is well-timed and character-driven,..."
9,2,Baseline,scene_block_2,Overall Quality,9,"The structure, tone, and genre fit are all str..."


### Episode 3

In [31]:
# Evaluate Baseline Episode 3 using scene descriptions and generated script blocks
baseline_evaluations_ep3 = evaluate_episode_blocks(
    client=client,
    episode_num=3,
    script_blocks=baseline_script_blocks[3],
    description_blocks=scene_description_blocks[3],
    episode_concept=episode_concept_3
)

In [32]:
# Access and print all scene block evaluations for Episode 3 (Baseline version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (Baseline) {'=' * 20}\n")
    print(baseline_evaluations_ep3[f"scene_block_{i}"])



- Coherence: 9 ‚Äì The scenes flow logically from one to the next, maintaining a consistent narrative. The only minor issue is the sudden introduction of Earl, the parole officer, which could have been hinted at earlier for a smoother transition.
- Relevance: 8 ‚Äì The scenes support the episode concept and character arcs well, revealing Jimmy's past and his relationship with Earl. However, the estranged daughter, Sophie, mentioned in the concept is not present in these scenes, which is a significant omission.
- Interestingness: 8 ‚Äì The scenes are engaging and narratively dynamic, with the mystery of the giant key and Jimmy's past as a criminal lock-picker. However, the scenes could benefit from more conflict or tension to heighten the drama and interest.
- Humor: 7 ‚Äì The humor is character-driven and appropriately varied, with Jimmy's quirky interactions with his keys and Earl's unexpected interest in lock-picking. However, some of the jokes feel a bit forced and the laugh track

In [33]:
# Extract scores into a DataFrame for Baseline Episode 3
baseline_scores_ep3_df = extract_evaluation_scores(
    baseline_evaluations_ep3,
    episode_number=3,
    version_label="Baseline"
)

display(baseline_scores_ep3_df)

Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,3,Baseline,scene_block_1,Coherence,9,The scenes flow logically from one to the next...
1,3,Baseline,scene_block_1,Relevance,8,The scenes support the episode concept and cha...
2,3,Baseline,scene_block_1,Interestingness,8,The scenes are engaging and narratively dynami...
3,3,Baseline,scene_block_1,Humor,7,The humor is character-driven and appropriatel...
4,3,Baseline,scene_block_1,Overall Quality,8,"The structure, tone, and genre fit are solid, ..."
5,3,Baseline,scene_block_2,Coherence,9,The scenes flow logically from one to the next...
6,3,Baseline,scene_block_2,Relevance,10,The scenes are highly relevant to the episode ...
7,3,Baseline,scene_block_2,Interestingness,8,The scenes are engaging and narratively dynami...
8,3,Baseline,scene_block_2,Humor,7,The humor is character-driven and appropriatel...
9,3,Baseline,scene_block_2,Overall Quality,8,"The structure, tone, and genre fit are all sol..."


### Episode 4

In [34]:
# Evaluate Baseline Episode 4 using scene descriptions and generated script blocks
baseline_evaluations_ep4 = evaluate_episode_blocks(
    client=client,
    episode_num=4,
    script_blocks=baseline_script_blocks[4],
    description_blocks=scene_description_blocks[4],
    episode_concept=episode_concept_4
)

In [35]:
# Access and print all scene block evaluations for Episode 4 (Baseline version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (Baseline) {'=' * 20}\n")
    print(baseline_evaluations_ep4[f"scene_block_{i}"])



- Coherence: 9 ‚Äì The scenes flow logically from one to the next, maintaining a consistent narrative. The characters' actions and dialogue are consistent with their established personalities and the situation they're in. The only minor issue is the repetition of the key drop, which could be seen as a bit redundant.
- Relevance: 10 ‚Äì The scenes are highly relevant to the episode concept, character arcs, and prior developments. They establish the main characters, their relationships, and the setting effectively. They also set up the main conflict of the episode: Samantha's decision to drop out of business school and join her father's locksmith shop.
- Interestingness: 8 ‚Äì The scenes are engaging and narratively dynamic, with a good mix of character development, conflict, and humor. However, they could benefit from a bit more originality, as the "business school dropout" and "ex-con starting a new life" tropes are fairly common.
- Humor: 7 ‚Äì The comedy is well-timed and character

In [36]:
# Extract scores into a DataFrame for Baseline Episode 5
baseline_scores_ep4_df = extract_evaluation_scores(
    baseline_evaluations_ep4,
    episode_number=4,
    version_label="Baseline"
)

display(baseline_scores_ep4_df)

‚ö†Ô∏è Skipped unmatched line in scene_block_5: - Overall Quality: 8.5 ‚Äì The scenes are well-structured, with a consistent tone and a good fit for the sitcom genre. The writing is polished, with clear, concise dialogue and effective use of stage directions. However, the scenes could benefit from more originality and subtlety in the humor.


Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,4,Baseline,scene_block_1,Coherence,9,The scenes flow logically from one to the next...
1,4,Baseline,scene_block_1,Relevance,10,The scenes are highly relevant to the episode ...
2,4,Baseline,scene_block_1,Interestingness,8,The scenes are engaging and narratively dynami...
3,4,Baseline,scene_block_1,Humor,7,"The comedy is well-timed and character-driven,..."
4,4,Baseline,scene_block_1,Overall Quality,8,"The structure, tone, and genre fit are all sol..."
5,4,Baseline,scene_block_2,Coherence,9,The scenes flow logically and maintain interna...
6,4,Baseline,scene_block_2,Relevance,10,"The scenes support the episode concept, charac..."
7,4,Baseline,scene_block_2,Interestingness,8,"The scenes are original and engaging, with a g..."
8,4,Baseline,scene_block_2,Humor,8,"The comedy is well-timed and character-driven,..."
9,4,Baseline,scene_block_2,Overall Quality,9,"The structure, tone, and genre fit are excelle..."


### Episode 5

In [37]:
# Evaluate Baseline Episode 5 using scene descriptions and generated script blocks
baseline_evaluations_ep5 = evaluate_episode_blocks(
    client=client,
    episode_num=5,
    script_blocks=baseline_script_blocks[5],
    description_blocks=scene_description_blocks[5],
    episode_concept=episode_concept_5
)

In [38]:
# Access and print all scene block evaluations for Episode 5 (Baseline version)
for i in range(1, 6):
    print(f"\n{'=' * 20} Scene Block {i} Evaluation (Baseline) {'=' * 20}\n")
    print(baseline_evaluations_ep5[f"scene_block_{i}"])



- Coherence: 9 ‚Äì The scenes flow logically from one to the next, maintaining a consistent narrative. Frankie's journey from prison to his new life as a locksmith is clear and easy to follow. The only minor issue is the sudden appearance of Len, the parole officer, which could have been foreshadowed for smoother transition.
  
- Relevance: 8 ‚Äì The scenes support the episode concept and character arcs well. We see Frankie's transition from prison to freedom, his introduction to his new life as a locksmith, and his first interactions with his parole officer. However, the estranged daughter, Ellie, is not introduced or mentioned in these scenes, which is a missed opportunity for character development.

- Interestingness: 7 ‚Äì The scenes are engaging and narratively dynamic, with Frankie's journey from prison to locksmith shop providing a compelling story. However, the scenes could benefit from more originality and unexpected twists to keep the audience guessing.

- Humor: 7 ‚Äì The 

In [39]:
# Extract scores into a DataFrame for Baseline Episode 5
baseline_scores_ep5_df = extract_evaluation_scores(
    baseline_evaluations_ep5,
    episode_number=5,
    version_label="Baseline"
)

display(baseline_scores_ep5_df)

Unnamed: 0,episode,version,scene_block,criterion,score,justification
0,5,Baseline,scene_block_1,Coherence,9,The scenes flow logically from one to the next...
1,5,Baseline,scene_block_1,Relevance,8,The scenes support the episode concept and cha...
2,5,Baseline,scene_block_1,Interestingness,7,The scenes are engaging and narratively dynami...
3,5,Baseline,scene_block_1,Humor,7,"The humor is character-driven and well-timed, ..."
4,5,Baseline,scene_block_1,Overall Quality,8,"The structure is solid, the tone is consistent..."
5,5,Baseline,scene_block_2,Coherence,9,The scenes flow logically from one to the next...
6,5,Baseline,scene_block_2,Relevance,10,The scenes are highly relevant to the episode ...
7,5,Baseline,scene_block_2,Interestingness,8,The scenes are engaging and narratively dynami...
8,5,Baseline,scene_block_2,Humor,7,"The humor is character-driven and well-timed, ..."
9,5,Baseline,scene_block_2,Overall Quality,8,"The scenes are well-structured, the tone is co..."


## Average Scores

Below, we calculate the average evaluation score for each criterion across all five episodes. We group by evaluation type (ReAct vs. Baseline) and compare how each version performed on coherence, relevance, interestingness, humor, and overall quality. This summary table provides a high-level view of performance differences across the two script-generation strategies.

In [42]:
# Combine all ReAct episode score DataFrames
react_all_df = pd.concat([
    react_scores_ep1_df,
    react_scores_ep2_df,
    react_scores_ep3_df,
    react_scores_ep4_df,
    react_scores_ep5_df
], ignore_index=True)

# Combine all Baseline episode score DataFrames
baseline_all_df = pd.concat([
    baseline_scores_ep1_df,
    baseline_scores_ep2_df,
    baseline_scores_ep3_df,
    baseline_scores_ep4_df,
    baseline_scores_ep5_df
], ignore_index=True)

# Combine both into a single full evaluation DataFrame
full_eval_df = pd.concat([react_all_df, baseline_all_df], ignore_index=True)

# Group by version and criterion, then compute average score
average_scores_df = (
    full_eval_df
    .groupby(["version", "criterion"])["score"]
    .mean()
    .round(2)
    .reset_index()
    .pivot(index="criterion", columns="version", values="score")
    .reset_index()
)

# Display the final comparison table
display(average_scores_df)


version,criterion,Baseline,ReAct
0,Coherence,8.96,7.76
1,Humor,7.68,7.92
2,Interestingness,7.88,7.56
3,Overall Quality,8.46,7.88
4,Relevance,9.6,8.28
