## File Set-Up
- Adds the repository root directory to `sys.path` to enable imports from the `src` package
- Determines the repository root by checking if the current directory is named "notebooks" (if so, uses parent; otherwise uses current directory)
- Imports standard path constants from `src.paths` module (PROJECT_ROOT, RAW_DIR, PROCESSED_DIR, REFERENCE_DIR, FIGURES_DIR)
- Defines notebook-specific input file paths for transcript summaries and comment analysis CSVs
- Defines notebook-specific output file paths for theme dictionary JSON, checkpoint JSON, and thematic analysis CSV
- Prints confirmation messages showing resolved paths for verification

## Configuration / Setup
- **Repository structure assumption**: notebook is either in a `notebooks/` subdirectory or at the repository root
- **Path constants** (imported from `src.paths`):
    - `PROJECT_ROOT`: repository root directory
    - `RAW_DIR`: raw data directory
    - `PROCESSED_DIR`: processed data directory
    - `REFERENCE_DIR`: reference data directory
    - `FIGURES_DIR`: figures output directory

## Inputs
- **`PROCESSED_DIR / "GPT_5_Mini_Transcripts_SummaryV2.csv"`**: transcript summary data (input)
- **`PROCESSED_DIR / "GPT_5_Mini_Comment_Analysis.csv"`**: comment analysis data (input)
- **`src.paths` module**: must exist at repository root to provide path constants

## Outputs
- **`REFERENCE_DIR / "Theme_Dictionary.json"`**: dynamic theme dictionary (created if missing, grows during processing)
- **`PROCESSED_DIR / "Theme_Checkpoint.json"`**: target path for processing checkpoint (defined but not created in this cell)
- **`PROCESSED_DIR / "Thematic_Analysis_Output.csv"`**: target path for thematic analysis results (defined but not created in this cell)
- **Console output**: prints paths for verification

## Notes / Assumptions
- Requires a `src/` package at the repository root with a `paths.py` module exporting the standard directory constants
- Path resolution logic assumes either: (1) notebook is in `<repo>/notebooks/` or (2) notebook is at `<repo>/`
- This is a pure configuration cell; it performs no data processing or file I/O beyond path setup
- All subsequent cells in the notebook depend on these path variable definitions
- **Theme Dictionary Behavior**: The dictionary is dynamically managed—created empty if missing, or loaded and extended during processing

In [None]:
# === Setup: Ensure repo root is on sys.path for imports ===
import sys
from pathlib import Path

nb_dir = Path.cwd().resolve()
repo_root = nb_dir.parent if nb_dir.name == "notebooks" else nb_dir
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

# === Import shared paths ===
from src.paths import PROJECT_ROOT, RAW_DIR, PROCESSED_DIR, REFERENCE_DIR, FIGURES_DIR

# === Notebook-specific file paths ===
# Input files
TRANSCRIPT_SUMMARY_CSV = PROCESSED_DIR / "GPT_5_Mini_Transcripts_SummaryV2.csv"
COMMENT_ANALYSIS_CSV = PROCESSED_DIR / "GPT_5_Mini_Comment_Analysis.csv"

# Output files
THEME_DICT_PATH = REFERENCE_DIR / "Theme_Dictionary.json"
CHECKPOINT_PATH = PROCESSED_DIR / "Theme_Checkpoint.json"
THEMATIC_OUTPUT_CSV = PROCESSED_DIR / "Thematic_Analysis_Output.csv"

print(f"Project root: {PROJECT_ROOT}")
print(f"Transcript summary input: {TRANSCRIPT_SUMMARY_CSV}")
print(f"Theme dictionary output: {THEME_DICT_PATH}")
print(f"Thematic analysis output: {THEMATIC_OUTPUT_CSV}")

## 01_Create Theme Dictionary for Comments using Transcript Summary
- Loads environment variables and initializes OpenAI client with `gpt-5-mini` model
- Reads transcript summaries and comment analysis CSVs from `PROCESSED_DIR`
- **Dynamically manages theme dictionary**: creates new file if missing, or loads existing themes to reuse
- Groups comments by VideoID and processes them in dynamically-sized batches (5-25 comments depending on theme count)
- For each batch, constructs a prompt instructing the model to:
  - **Reuse existing themes** if the model has ≥65% certainty the theme applies
  - **Create new themes** only when no existing theme fits with sufficient confidence
  - Avoid being too specific when creating themes to encourage reuse
  - Target staying under 100 total themes by preferring reuse
- Sends batch to OpenAI API with retry logic (up to 6 attempts with exponential backoff for connection/timeout errors)
- Marks comments as processed in checkpoint **before** API call to prevent re-processing on failure
- Parses JSON response containing theme classifications and any new themes created
- **Updates theme dictionary** with any new themes returned by the model
- Appends classification results to output CSV with columns: VideoID, CommentID, ThemesJSON

## Configuration / Setup
- **`OPENAI_API_KEY`**: must be set in `.env` file for API access
- **`MODEL_NAME`**: set to `"gpt-5-mini"`
- **`BATCH_SIZE`**: base default of 20, dynamically adjusted based on theme count (5 for >60 themes, 25 for <40 themes)
- **`THEME_REUSE_THRESHOLD`**: 65% certainty required to reuse an existing theme
- **`MAX_THEMES_TARGET`**: soft limit of 100 themes to encourage reuse
- **Token estimation limit**: batches are dynamically resized to keep prompt under ~6000 tokens
- **Retry settings**: max 6 retries with 4-second base backoff multiplier

## Inputs
- **`PROCESSED_DIR / "GPT_5_Mini_Transcripts_SummaryV2.csv"`**: contains VideoID and Summary columns for video context
- **`PROCESSED_DIR / "GPT_5_Mini_Comment_Analysis.csv"`**: contains VideoID, CommentID, CommentIndex, and Topic columns for comments to classify
- **`REFERENCE_DIR / "Theme_Dictionary.json"`**: dynamic theme dictionary (created if missing, updated with new themes)
- **`PROCESSED_DIR / "Theme_Checkpoint.json"`**: tracks which comments have been processed (loaded if exists)
- **OpenAI Chat Completions API**: called with JSON response format for each batch

## Outputs
- **`PROCESSED_DIR / "Thematic_Analysis_Output.csv"`**: CSV with columns VideoID, CommentID, ThemesJSON (where ThemesJSON is a JSON string mapping theme names to percentages)
- **`PROCESSED_DIR / "Theme_Checkpoint.json"`**: updated after each batch with processed comment keys (`"VideoID::CommentIndex"`)
- **`REFERENCE_DIR / "Theme_Dictionary.json"`**: dynamically updated with new themes as they are created during classification
- **Console output**: progress messages, batch token estimates, new themes created, warnings for percentage mismatches

## Notes / Assumptions
- Requires `OPENAI_API_KEY` in environment and `.env` file loaded via `python-dotenv`
- Assumes `COMMENT_ANALYSIS_CSV` has columns: `VideoID`, `CommentID`, `CommentIndex`, `Topic` (singular, not plural)
- Assumes `TRANSCRIPT_SUMMARY_CSV` has columns: `VideoID`, `Summary`
- Theme dictionary structure: `{"themes": [{"name": "...", "description": "..."}, ...], "count": int}`
- **Theme reuse logic**: Model must have ≥65% certainty to use an existing theme; otherwise creates a new one
- **Theme creation guidance**: New themes should be broad enough for reuse, not overly specific
- Checkpoint behavior: comments are marked processed **before** API call to prevent duplicate calls on retry/crash
- Video summary is truncated to first 1000 characters to reduce token usage
- Batch size dynamically halves if estimated tokens exceed 6000 (minimum batch size = 1)
- Expects model to return JSON with `"classifications"` array and optional `"new_themes"` array
- All file I/O uses `pathlib.Path` objects defined in setup cell

In [None]:
import os
import json
import csv
from pathlib import Path
from typing import Dict, Any, List, Set

import pandas as pd
from dotenv import load_dotenv
from openai import OpenAI
from openai import APIConnectionError, APITimeoutError
import httpx
import time


# =========================
# CONFIG (using portable paths from configuration cell)
# =========================

MODEL_NAME = "gpt-5-mini"
BATCH_SIZE = 20  # base default; dynamic per theme count
THEME_REUSE_THRESHOLD = 65  # minimum certainty % to reuse an existing theme
MAX_THEMES_TARGET = 100  # soft limit to encourage theme reuse


# =========================
# UTILITIES
# =========================

def load_theme_dictionary(path: Path) -> Dict[str, Any]:
    """
    Load existing theme dictionary or create an empty one if it doesn't exist.
    Returns dict with 'themes' list and 'count' integer.
    """
    if path.exists():
        with path.open("r", encoding="utf-8") as f:
            data = json.load(f)
            # Ensure count is present
            if "count" not in data:
                data["count"] = len(data.get("themes", []))
            print(f"Loaded existing theme dictionary with {data['count']} themes")
            return data
    
    # Create new empty theme dictionary
    theme_dict = {"themes": [], "count": 0}
    with path.open("w", encoding="utf-8") as f:
        json.dump(theme_dict, f, indent=2)
    print("Created new empty theme dictionary")
    return theme_dict


def save_theme_dictionary(path: Path, theme_dict: Dict[str, Any]) -> None:
    """Save the theme dictionary to disk."""
    theme_dict["count"] = len(theme_dict.get("themes", []))
    with path.open("w", encoding="utf-8") as f:
        json.dump(theme_dict, f, indent=2)


def add_new_themes(theme_dict: Dict[str, Any], new_themes: List[Dict[str, str]], path: Path) -> None:
    """
    Add new themes to the dictionary if they don't already exist.
    Saves to disk after adding.
    """
    existing_names = {t["name"].lower() for t in theme_dict.get("themes", [])}
    added_count = 0
    
    for theme in new_themes:
        theme_name = theme.get("name", "").strip()
        if not theme_name:
            continue
        if theme_name.lower() not in existing_names:
            theme_dict["themes"].append({
                "name": theme_name,
                "description": theme.get("description", "")
            })
            existing_names.add(theme_name.lower())
            added_count += 1
            print(f"  + New theme added: '{theme_name}'")
    
    if added_count > 0:
        save_theme_dictionary(path, theme_dict)
        print(f"  Total themes now: {len(theme_dict['themes'])}")


def load_checkpoint(path: Path) -> Set[str]:
    if not path.exists():
        return set()
    with path.open("r", encoding="utf-8") as f:
        data = json.load(f)
    return set(data.get("processed_comment_keys", []))


def save_checkpoint(path: Path, processed_keys: Set[str]) -> None:
    data = {"processed_comment_keys": sorted(processed_keys)}
    with path.open("w", encoding="utf-8") as f:
        json.dump(data, f, indent=2)


def init_output_csv(path: Path) -> None:
    """Output only VideoID, CommentID, ThemeName→% JSON"""
    if path.exists():
        return
    with path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["VideoID", "CommentID", "ThemesJSON"])


def append_classifications_to_csv(
    path: Path,
    classifications: List[Dict[str, Any]],
) -> None:
    """
    Writes rows:
    - VideoID
    - CommentID
    - ThemesJSON (mapping theme_name → percentage)
    """
    with path.open("a", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)

        for c in classifications:
            theme_map = {}
            for t in c.get("themes", []):
                theme_name = t["theme"]
                pct = t["percentage"]
                theme_map[theme_name] = pct

            writer.writerow([
                c.get("video_id"),
                c.get("comment_id"),
                json.dumps(theme_map)
            ])

        # Force flush to disk so long runs don't lose buffered rows
        f.flush()
        os.fsync(f.fileno())


# =========================
# TOKEN APPROX + MESSAGE BUILD
# =========================

def approx_tokens(s: str) -> int:
    # Rough ~4 chars per token rule of thumb
    return int(len(s) / 4)


def build_messages_for_batch(
    theme_payload: Dict[str, Any],
    video_id: Any,
    summary: str,
    comments_payload: List[Dict[str, Any]],
    theme_reuse_threshold: int,
    max_themes_target: int
) -> List[Dict[str, str]]:
    """
    Build messages for dynamic theme classification.
    theme_payload: {"themes": [{"name": "...", "description": "..."}, ...], "count": int}
    comments_payload: list of dicts with "comment_key", "comment_index", "comment_id", "topics"
    """

    current_theme_count = theme_payload.get("count", 0)
    themes_remaining = max(0, max_themes_target - current_theme_count)

    system_instructions = f"""You are an electrical engineering expert performing thematic classification of YouTube comments.

THEME MANAGEMENT RULES:
1. You are given the current theme dictionary with existing themes (may be empty initially).
2. For each comment, determine which theme(s) apply.
3. REUSE an existing theme if you have at least {theme_reuse_threshold}% certainty it applies to the comment.
4. CREATE a new theme ONLY when no existing theme fits with {theme_reuse_threshold}%+ certainty.
5. When creating new themes:
   - Make them BROAD enough to be reused for similar future comments
   - Avoid overly specific themes that only apply to one comment
   - Use clear, descriptive names related to electrical engineering/construction
   - Include a brief description of what the theme covers
6. Current theme count: {current_theme_count}. Target maximum: {max_themes_target}. 
   Approximately {themes_remaining} new themes can be created before reaching the soft limit.
   STRONGLY prefer reusing existing themes when possible.

CLASSIFICATION RULES:
1. Each comment may map to one or more themes (existing or newly created).
2. For each comment, assign INTEGER percentages to themes that sum to EXACTLY 100.
3. Be consistent: similar comments should be classified to the same themes.

OUTPUT FORMAT (STRICT JSON):
{{
  "new_themes": [
    {{"name": "Theme Name Here", "description": "Brief description of what this theme covers"}}
  ],
  "classifications": [
    {{
      "video_id": "...",
      "comment_key": "...",
      "comment_index": <int>,
      "comment_id": "... or null",
      "themes": [
        {{"theme": "Existing or New Theme Name", "percentage": 60}},
        {{"theme": "Another Theme", "percentage": 40}}
      ]
    }}
  ]
}}

IMPORTANT:
- "new_themes" array contains ONLY themes you are creating (not existing ones)
- "new_themes" can be empty [] if all comments fit existing themes
- Theme names in "classifications" must match EXACTLY (existing theme names or new theme names)
- No extra keys. No explanations. Strict JSON only."""

    # Build theme list for the prompt
    if theme_payload.get("themes"):
        theme_list_str = json.dumps(theme_payload["themes"], ensure_ascii=False, indent=2)
    else:
        theme_list_str = "[] (No themes yet - you will create the initial themes)"

    return [
        {"role": "system", "content": system_instructions},
        {
            "role": "user",
            "content": f"CURRENT THEME DICTIONARY ({current_theme_count} themes):\n{theme_list_str}"
        },
        {
            "role": "user",
            "content": f"VideoID: {video_id}\nVideo Summary (for context):\n{summary}"
        },
        {
            "role": "user",
            "content": "Comments to classify (use existing themes when ≥65% confident, create new themes otherwise):\n" +
                       json.dumps(comments_payload, ensure_ascii=False, indent=2)
        }
    ]


# =========================
# OPENAI CALL (with retries, dynamic theme handling)
# =========================

def call_openai_for_batch(
    client: OpenAI,
    theme_payload: Dict[str, Any],
    video_id: Any,
    summary: str,
    comments_payload: List[Dict[str, Any]],
    theme_reuse_threshold: int,
    max_themes_target: int
) -> Dict[str, Any]:

    messages = build_messages_for_batch(
        theme_payload, video_id, summary, comments_payload,
        theme_reuse_threshold, max_themes_target
    )

    max_retries = 6
    backoff_base = 4

    for attempt in range(1, max_retries + 1):
        try:
            response = client.chat.completions.create(
                model=MODEL_NAME,
                messages=messages,
                response_format={"type": "json_object"}
            )
            break

        except (APIConnectionError, APITimeoutError, httpx.ReadTimeout) as e:
            print(f"[Attempt {attempt}/{max_retries}] Timeout/Connection Error: {e}")
            if attempt == max_retries:
                raise
            wait_time = backoff_base * attempt
            print(f"Retrying in {wait_time} seconds...\n")
            time.sleep(wait_time)

    content = response.choices[0].message.content

    try:
        data = json.loads(content)
    except json.JSONDecodeError:
        raise RuntimeError(f"Model returned invalid JSON:\n{content}")

    if "classifications" not in data:
        raise RuntimeError(f"Model JSON missing 'classifications':\n{json.dumps(data, indent=2)}")

    # Ensure new_themes exists (may be empty)
    if "new_themes" not in data:
        data["new_themes"] = []

    return data


# =========================
# MAIN EXECUTION
# =========================

def main():
    load_dotenv()
    api_key = os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise RuntimeError("OPENAI_API_KEY missing in .env")

    client = OpenAI(timeout=60)

    transcripts_df = pd.read_csv(TRANSCRIPT_SUMMARY_CSV)
    comments_df = pd.read_csv(COMMENT_ANALYSIS_CSV)

    comments_df = comments_df.reset_index().rename(columns={"index": "CommentIndex"})

    # Load or create theme dictionary
    theme_dict = load_theme_dictionary(THEME_DICT_PATH)

    # Dynamic batch base size depending on how many themes we have
    theme_count = len(theme_dict.get("themes", []))
    if theme_count > 60:
        base_batch_size = 5
    elif theme_count > 40:
        base_batch_size = 10
    elif theme_count < 10:
        # Smaller batches at start to build up themes gradually
        base_batch_size = 10
    else:
        base_batch_size = BATCH_SIZE

    print(f"Starting with {theme_count} themes. Using base batch size = {base_batch_size}")
    print(f"Theme reuse threshold: {THEME_REUSE_THRESHOLD}%")
    print(f"Max themes target: {MAX_THEMES_TARGET}")

    processed_keys = load_checkpoint(CHECKPOINT_PATH)
    init_output_csv(THEMATIC_OUTPUT_CSV)

    video_summary_map = transcripts_df.set_index("VideoID")["Summary"].to_dict()
    grouped = comments_df.groupby("VideoID")

    try:
        for video_id, group in grouped:
            if video_id not in video_summary_map:
                continue

            # Truncate summary to reduce token usage
            raw_summary = str(video_summary_map[video_id])
            summary = raw_summary[:1000]

            group = group.sort_values("CommentIndex")

            rows_to_process = []
            for _, row in group.iterrows():
                ck = f"{video_id}::{int(row['CommentIndex'])}"
                if ck not in processed_keys:
                    rows_to_process.append(row)

            if not rows_to_process:
                continue

            print(f"\nStarting VideoID={video_id}, {len(rows_to_process)} comments to process")

            # Pointer-based loop so we can adapt batch size dynamically per batch
            idx = 0
            total = len(rows_to_process)

            while idx < total:
                # Recalculate batch size based on current theme count
                current_theme_count = len(theme_dict.get("themes", []))
                if current_theme_count > 60:
                    cur_base_batch = 5
                elif current_theme_count > 40:
                    cur_base_batch = 10
                elif current_theme_count < 10:
                    cur_base_batch = 10
                else:
                    cur_base_batch = base_batch_size

                cur_batch_size = min(cur_base_batch, total - idx)

                while True:
                    batch_rows = rows_to_process[idx: idx + cur_batch_size]

                    # Build comments payload for this tentative batch
                    comments_payload = []
                    for row in batch_rows:
                        comment_index = int(row["CommentIndex"])
                        comment_id = None
                        if "CommentID" in row and not pd.isna(row["CommentID"]):
                            comment_id = str(row["CommentID"])

                        topics = str(row["Topic"]) if "Topic" in row else ""
                        comments_payload.append({
                            "comment_key": f"{video_id}::{comment_index}",
                            "comment_index": comment_index,
                            "comment_id": comment_id,
                            "topics": topics
                        })

                    # Build theme payload for API call
                    theme_payload = {
                        "themes": theme_dict.get("themes", []),
                        "count": len(theme_dict.get("themes", []))
                    }

                    # Approximate token usage for this prompt
                    prompt_body = (
                        json.dumps(theme_payload, separators=(',', ':')) +
                        summary +
                        json.dumps(comments_payload, separators=(',', ':'))
                    )
                    token_est = approx_tokens(prompt_body)

                    if token_est <= 6000 or cur_batch_size == 1:
                        # Accept this batch size
                        break

                    # Too big: shrink batch size and try again
                    cur_batch_size = max(1, cur_batch_size // 2)

                # === Mark all keys in this batch as processed BEFORE the API call ===
                batch_keys = [cp["comment_key"] for cp in comments_payload]
                processed_keys.update(batch_keys)
                save_checkpoint(CHECKPOINT_PATH, processed_keys)

                print(f"Processing batch: idx={idx}, size={len(comments_payload)}, themes={theme_payload['count']}, tokens≈{token_est}")

                result = call_openai_for_batch(
                    client=client,
                    theme_payload=theme_payload,
                    video_id=video_id,
                    summary=summary,
                    comments_payload=comments_payload,
                    theme_reuse_threshold=THEME_REUSE_THRESHOLD,
                    max_themes_target=MAX_THEMES_TARGET
                )

                # Handle new themes
                new_themes = result.get("new_themes", [])
                if new_themes:
                    print(f"  Model created {len(new_themes)} new theme(s):")
                    add_new_themes(theme_dict, new_themes, THEME_DICT_PATH)

                classifications = result["classifications"]

                # Normalize / sanity-check classifications
                formatted_rows = []
                for c in classifications:
                    comment_key = c.get("comment_key")
                    comment_index = c.get("comment_index")
                    comment_id = c.get("comment_id", None)
                    themes = c.get("themes", [])

                    total_pct = sum(t.get("percentage", 0) for t in themes)
                    if total_pct != 100:
                        print(f"  WARNING: Percentages for comment_key={comment_key} sum to {total_pct}, not 100.")

                    formatted_rows.append({
                        "video_id": c.get("video_id", video_id),
                        "comment_key": comment_key,
                        "comment_index": comment_index,
                        "comment_id": comment_id,
                        "themes": themes
                    })

                append_classifications_to_csv(THEMATIC_OUTPUT_CSV, formatted_rows)

                idx += cur_batch_size

            print(f"Finished VideoID={video_id}")

        print(f"\n=== Processing Complete ===")
        print(f"Final theme count: {len(theme_dict.get('themes', []))}")

    except KeyboardInterrupt:
        print("\nStopping safely…")
        save_checkpoint(CHECKPOINT_PATH, processed_keys)
        save_theme_dictionary(THEME_DICT_PATH, theme_dict)
    except Exception as e:
        save_checkpoint(CHECKPOINT_PATH, processed_keys)
        save_theme_dictionary(THEME_DICT_PATH, theme_dict)
        raise


if __name__ == "__main__":
    main()

### 02_Theme Dictionary Results

In [None]:
# Some code displaying the theme dictionary contents and counts