## File Set-Up
- Sets up the Python path to include the repository root, ensuring imports from `src/` work correctly regardless of where the notebook is run
- Detects whether the notebook is in a `notebooks/` subdirectory or the repo root, then resolves the correct `repo_root`
- Imports shared path constants from `src.paths` (PROJECT_ROOT, RAW_DIR, PROCESSED_DIR, REFERENCE_DIR, FIGURES_DIR)
- Defines notebook-specific input and output file paths using the imported directory constants
- Prints the resolved project root and key file paths for verification

## Configuration / Setup
- **`nb_dir`**: Current working directory of the notebook
- **`repo_root`**: Automatically detected repository root (parent of `notebooks/` if applicable)
- **Input file**: `TRANSCRIPTS_FILE` = `data/raw/transcripts.json`
- **Output files**:
    - `TRANSCRIPT_SUMMARY_CSV` = `data/processed/GPT_5_Mini_Transcripts_Summary.csv`
    - `TRANSCRIPT_SUMMARY_V2` = `data/processed/GPT_5_Mini_Transcripts_SummaryV2.csv`
    - `CHECKPOINT_FILE` = `data/processed/gpt_5_mini_checkpoint.json`

## Inputs
- **`src/paths.py`**: Module defining `PROJECT_ROOT`, `RAW_DIR`, `PROCESSED_DIR`, `REFERENCE_DIR`, `FIGURES_DIR`
- **Expected directory structure**: A `notebooks/` folder (optional) and a `src/` module at the repository root

## Outputs
- Console output showing:
    - Resolved project root path
    - Transcripts input file path
    - Processed output directory path
- No files are written by this cell; it only configures paths for downstream cells

## Notes / Assumptions
- Assumes `src/paths.py` exists and exports the required path constants
- Assumes the repository has either a flat structure or a `notebooks/` subdirectory
- Subsequent cells (1, 3, 5) depend on the path variables defined here (`TRANSCRIPTS_FILE`, `TRANSCRIPT_SUMMARY_CSV`, `TRANSCRIPT_SUMMARY_V2`, `CHECKPOINT_FILE`, `PROCESSED_DIR`)
- This is a setup cell and must be run before any data processing cells

In [None]:
# === Setup: Ensure repo root is on sys.path for imports ===
import sys
from pathlib import Path

nb_dir = Path.cwd().resolve()
repo_root = nb_dir.parent if nb_dir.name == "notebooks" else nb_dir
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

# === Import shared paths ===
from src.paths import PROJECT_ROOT, RAW_DIR, PROCESSED_DIR, REFERENCE_DIR, FIGURES_DIR

# === Notebook-specific file paths ===
# Input files
TRANSCRIPTS_FILE = RAW_DIR / "transcripts.json"

# Output files
TRANSCRIPT_SUMMARY_CSV = PROCESSED_DIR / "GPT_5_Mini_Transcripts_Summary.csv"
TRANSCRIPT_SUMMARY_V2 = PROCESSED_DIR / "GPT_5_Mini_Transcripts_SummaryV2.csv"
CHECKPOINT_FILE = PROCESSED_DIR / "gpt_5_mini_checkpoint.json"

print(f"Project root: {PROJECT_ROOT}")
print(f"Transcripts input: {TRANSCRIPTS_FILE}")
print(f"Processed output dir: {PROCESSED_DIR}")

## 01_Summarize_Transcripts
- Loads transcripts from `transcripts.json` and an existing CSV summary (or creates a new one)
- Uses OpenAI's GPT-5-mini model to generate single-paragraph summaries for each transcript
- Skips videos already summarized or with insufficient content (< 30 words)
- Saves progress incrementally after each summary to resume on interruption
- Implements a graceful shutdown handler (CTRL+C) that saves all progress before exiting
- Adds random 2–6 second delays between API calls to avoid rate limiting

## Configuration / Setup
- **`CSV_FILE`**: Input CSV file = `data/processed/GPT_5_Mini_Transcripts_Summary.csv`
- **`OUTPUT_FILE`**: Output CSV file = `data/processed/GPT_5_Mini_Transcripts_SummaryV2.csv`
- **`MODEL_NAME`**: OpenAI model = `"gpt-5-mini"`
- **`OPENAI_API_KEY`**: Environment variable loaded from `.env` file (required)
- **Prompt template**: Enforces a specific summary format via system message

## Inputs
- **`data/raw/transcripts.json`**: JSON file containing video transcripts with `VideoID` and `Transcript` fields
- **`data/processed/GPT_5_Mini_Transcripts_Summary.csv`** (optional): Existing CSV to resume from, or base file to start with
- **OpenAI API**: GPT-5-mini chat completions endpoint

## Outputs
- **`data/processed/GPT_5_Mini_Transcripts_SummaryV2.csv`**: Updated CSV with a `Summary` column containing GPT-generated summaries
- Progress is saved after each summary, enabling safe resume on failure or interruption
- Console output showing processing status for each video

## Notes / Assumptions
- Assumes `transcripts.json` contains a list of dictionaries with `VideoID` and `Transcript` keys
- Requires the CSV to have or accept a `Summary` column
- Skips transcripts with fewer than 30 words or missing video IDs
- Depends on path variables (`TRANSCRIPTS_FILE`, `TRANSCRIPT_SUMMARY_CSV`, `TRANSCRIPT_SUMMARY_V2`) defined in cell 1
- Uses a signal handler to catch CTRL+C and save progress before exiting
- Summaries marked as "No summary" indicate insufficient or missing transcript data

In [None]:
import json
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv
import os
import time
import random
import signal
import sys

# === CONFIG (using portable paths from configuration cell) ===
CSV_FILE = TRANSCRIPT_SUMMARY_CSV
OUTPUT_FILE = TRANSCRIPT_SUMMARY_V2
MODEL_NAME = "gpt-5-mini"

# === LOAD ENVIRONMENT VARIABLES ===
load_dotenv()
API_KEY = os.getenv("OPENAI_API_KEY")

if not API_KEY:
    raise ValueError("Missing OPENAI_API_KEY in .env file")

# === INIT OPENAI CLIENT ===
client = OpenAI(api_key=API_KEY)

# === LOAD TRANSCRIPTS ===
with open(TRANSCRIPTS_FILE, "r", encoding="utf-8") as f:
    transcripts = json.load(f)

# === LOAD EXISTING OUTPUT OR BASE FILE ===
if OUTPUT_FILE.exists():
    print(f"Resuming from existing file: {OUTPUT_FILE}")
    df = pd.read_csv(OUTPUT_FILE)
else:
    df = pd.read_csv(CSV_FILE)

# Ensure there's a Summary column
if "Summary" not in df.columns:
    df["Summary"] = ""

# === CREATE LOOKUP FOR TRANSCRIPTS ===
transcript_lookup = {
    item["VideoID"]: item["Transcript"]
    for item in transcripts
    if "VideoID" in item and "Transcript" in item
}

# === SIGNAL HANDLER TO SAVE ON CTRL+C ===
def handle_interrupt(signal_received, frame):
    print("\nCTRL+C detected — saving progress...")
    df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8")
    print(f"Progress saved to: {OUTPUT_FILE}")
    sys.exit(0)

signal.signal(signal.SIGINT, handle_interrupt)

# === MAIN LOOP ===
for i, row in df.iterrows():
    video_id = row.get("VideoID")

    # Skip if already summarized
    if pd.notna(row["Summary"]) and len(str(row["Summary"]).strip()) > 10:
        continue

    if video_id not in transcript_lookup:
        print(f"Skipping {video_id} — transcript not found.")
        df.at[i, "Summary"] = "No summary"
        df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8")
        continue

    transcript_text = transcript_lookup[video_id].strip()

    # Skip empty or too short transcripts
    if len(transcript_text.split()) < 30:
        df.at[i, "Summary"] = "No summary"
        df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8")
        print(f" Skipping {video_id} — not enough information.")
        continue

    try:
        # Ask GPT-5-mini for a consistent compact paragraph summary
        response = client.chat.completions.create(
            model=MODEL_NAME,
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a precise summarizer that outputs only a single short paragraph. "
                        "Follow this exact style:\n\n"
                        "This transcript explains [main topic], covering [key technical points]. "
                        "It discusses [relevant Electrical Engineering aspects] and explains [notable insight]. "
                        "Overall, it emphasizes [core conclusion or application]. "
                        "If there is not enough information, respond with exactly: No summary."
                    ),
                },
                {
                    "role": "user",
                    "content": f"Summarize the following transcript:\n\n{transcript_text}",
                },
            ]
        )

        summary = response.choices[0].message.content.strip()

        # Handle empty or malformed outputs
        if not summary or summary.lower() == "no summary":
            summary = "No summary"

        df.at[i, "Summary"] = summary
        print(f" Summarized {video_id}")

        # Save progress after each summary
        df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8")

        # Random delay (2–6 s)
        time.sleep(random.uniform(2, 6))

    except Exception as e:
        print(f"Error summarizing {video_id}: {e}")
        time.sleep(10)

# === FINAL SAVE ===
df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8")
print(f"\n All done! Updated file saved to: {OUTPUT_FILE}")

## 02_Transcript Problem/Solution Extraction + Thematic Analysis
- Loads transcripts from `transcripts.json` and extracts problem-solution pairs along with multi-topic percentage classifications using GPT-5-mini
- Parses GPT output to identify general topics (e.g., Wiring, Safety, Tools) and subtopics with percentage relevance (e.g., Circuit Design 40%, NEC Compliance 60%)
- Normalizes percentages across topics and subtopics to ensure they sum to 100%, correcting for rounding errors
- Implements checkpoint-based resume capability to skip already-processed videos and recover from interruptions
- Saves results incrementally to CSV after each successful video analysis
- Retries failed API calls up to 2 times if invalid output is received
- Supports graceful shutdown on keyboard interrupt (CTRL+C), preserving all progress in checkpoint file

## Configuration / Setup
- **`API_KEY`**: OpenAI API key loaded from `.env` file (required)
- **`API_URL`**: OpenAI chat completions endpoint = `"https://api.openai.com/v1/chat/completions"`
- **`OUTPUT_DIR`**: Output directory = `PROCESSED_DIR` (from cell 1)
- **`MODEL_NAME`**: GPT-5-mini (specified in API request)
- **`MAX_RETRIES`**: Maximum retry attempts per video = `2`
- **Topic taxonomy**: `GENERAL_TOPICS` (8 categories) and `SUBTOPIC_GROUPS` (7 topic families with 5 subtopics each)

## Inputs
- **`data/raw/transcripts.json`**: JSON file containing video transcripts with `VideoID`, `Title`, `URL`, and `Transcript` fields (via `TRANSCRIPTS_FILE` from cell 1)
- **`data/processed/gpt_5_mini_checkpoint.json`** (optional): Checkpoint file to resume from previously processed videos
- **OpenAI API**: GPT-5-mini chat completions endpoint for problem/solution extraction and topic classification

## Outputs
- **`data/processed/GPT_5_Mini_Transcripts_Summary.csv`**: CSV with columns `VideoID`, `Title`, `URL`, `Problem`, `Solution`, `Topics`, `Subtopics` (topics/subtopics formatted as "Label (X%)")
- **`data/processed/gpt_5_mini_checkpoint.json`**: JSON checkpoint file storing processed video metadata, enabling resume on interruption
- Console output showing processing status for each video ID

## Notes / Assumptions
- Assumes `transcripts.json` contains a list of dictionaries with required keys: `VideoID`, `Title`, `URL`, `Transcript`
- Depends on path variables (`TRANSCRIPTS_FILE`, `PROCESSED_DIR`) defined in cell 1
- GPT output must follow a strict format with "Problem:", "Solution:", "Topics:", and "Subtopics:" sections
- Topics and subtopics are extracted via regex matching lines like `- Label (X%)`
- CSV is appended to incrementally; existing rows are preserved if the file already exists
- Checkpoint file prevents duplicate processing when the script is rerun
- The `normalize_percentages` function ensures valid percentage distributions even if GPT output is malformed

In [None]:
import requests
import os
import json
import csv
from dotenv import load_dotenv
import re

# === Load environment variables ===
load_dotenv()
API_KEY = os.getenv("OPENAI_API_KEY")
API_URL = "https://api.openai.com/v1/chat/completions"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# === Output directory (using portable path from configuration cell) ===
OUTPUT_DIR = PROCESSED_DIR
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# === Topic Definitions ===
GENERAL_TOPICS = [
    "Wiring", "Safety", "Tools", "Planning",
    "Education", "Technology", "Codes & Regulations", "Undefined"
]

SUBTOPIC_GROUPS = {
    "Wiring": ["Residential Wiring", "Circuit Design", "Conduit Wiring", "Panel Wiring", "Load Balancing"],
    "Safety": ["Tool Safety", "Electrical shock/fire safety", "PPE Usage", "Hazard Prevention"],
    "Tools": ["Tool Durability", "Tool Usage", "Measurement Tools", "Cutting Tools", "Power Tools"],
    "Planning": ["Wiring Planning", "Circuit Layout", "Load Calculation", "Electrical Design", "Construction Planning"],
    "Education": ["Apprenticeship", "Training", "Instruction", "Leadership Development", "Student Projects"],
    "Technology": ["Modern Tools", "Automation", "Smart Systems", "Prefabrication", "Innovation"],
    "Codes & Regulations": ["NEC Compliance", "Local Codes", "Permitting", "Building Codes", "Inspection Requirements"]
}

MAX_RETRIES = 2


# === GPT-5 Request Function ===
def ask_openai(transcript):
    """Send transcript to GPT-5-mini and get multi-topic percentage classification"""
    prompt = f"""
    From the following transcript, extract:
    1. Problem: The main challenge or issue being addressed.
    2. Solution: A concise description of how the problem was solved.
    3. Topics: Assign one or more general topics (from {", ".join(GENERAL_TOPICS)}) 
       with estimated percentage relevance. 
       Example: 
         - Wiring (40%)
         - Codes & Regulations (60%)
    4. Subtopics: Assign subtopics under each selected topic with their respective percentages.
       Example:
         - Circuit Design (40%)
         - NEC Compliance (60%)

    Rules:
    - Percentages across topics and subtopics must total approximately 100%.
    - Include at least one topic and one subtopic.
    - Never use "Undefined" unless no suitable label applies.
    - Be consistent and concise.

    Respond **exactly** in this structure:

    Problem: <text>
    Solution: <text>
    Topics:
      - <Topic 1> (<percent>%)
      - <Topic 2> (<percent>%)
    Subtopics:
      - <Subtopic 1> (<percent>%)
      - <Subtopic 2> (<percent>%)

    Transcript:
    {transcript}
    """

    data = {
        "model": "gpt-5-mini",
        "messages": [
            {"role": "system", "content": "You are an assistant that classifies electrical engineering transcripts into multiple topics with percentage relevance."},
            {"role": "user", "content": prompt}
        ]
    }

    response = requests.post(API_URL, headers=headers, json=data)
    if response.status_code == 200:
        res_json = response.json()
        return res_json["choices"][0]["message"]["content"]
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return None


# === Result Parsing ===
def parse_result(result):
    """Parse GPT output into structured components"""
    problem, solution = "Not Extracted", "Not Extracted"
    topics, subtopics = [], []

    lines = result.splitlines()
    section = None

    for line in lines:
        if line.lower().startswith("problem"):
            problem = line.split(":", 1)[-1].strip()
        elif line.lower().startswith("solution"):
            solution = line.split(":", 1)[-1].strip()
        elif line.lower().startswith("topics"):
            section = "topics"
        elif line.lower().startswith("subtopics"):
            section = "subtopics"
        elif line.strip().startswith("-"):
            match = re.match(r"-\s*(.+?)\s*\((\d+)%\)", line.strip())
            if match:
                label, percent = match.groups()
                percent = int(percent)
                if section == "topics":
                    topics.append((label, percent))
                elif section == "subtopics":
                    subtopics.append((label, percent))

    # Normalize percentages so they sum to 100
    topics = normalize_percentages(topics)
    subtopics = normalize_percentages(subtopics)

    return problem, solution, topics, subtopics


def normalize_percentages(pairs):
    """Ensure percentages sum to 100 and fix rounding errors"""
    if not pairs:
        return []

    total = sum(p[1] for p in pairs)
    if total == 0:
        equal_share = round(100 / len(pairs))
        return [(p[0], equal_share) for p in pairs]

    normalized = [(p[0], round(p[1] * 100 / total)) for p in pairs]
    diff = 100 - sum(p[1] for p in normalized)
    if diff != 0:
        first = list(normalized[0])
        first[1] += diff
        normalized[0] = tuple(first)
    return normalized


# === Main Transcript Processing ===
def process_transcripts(json_file, output_csv, checkpoint_file):
    with open(json_file, "r", encoding="utf-8") as f:
        transcripts = json.load(f)

    checkpoint_data = {}
    if checkpoint_file.exists():
        with open(checkpoint_file, "r", encoding="utf-8") as f:
            checkpoint_data = json.load(f)

    write_header = not output_csv.exists()
    with open(output_csv, "a", newline="", encoding="utf-8") as csvfile:
        writer = csv.writer(csvfile)
        if write_header:
            writer.writerow(["VideoID", "Title", "URL", "Problem", "Solution", "Topics", "Subtopics"])

        for item in transcripts:
            vid = item["VideoID"]
            if vid in checkpoint_data:
                print(f"Skipping Video {vid} (already processed).")
                continue

            print(f"Processing Video {vid}: {item['Title'][:50]}...")

            retries = 0
            while retries <= MAX_RETRIES:
                result = ask_openai(item["Transcript"])
                if not result:
                    retries += 1
                    continue

                problem, solution, topics, subtopics = parse_result(result)
                if not topics:
                    retries += 1
                    print(f"Retry {retries} for Video {vid} due to invalid output...")
                    continue

                # Convert to readable strings for CSV
                topics_str = ", ".join([f"{t} ({p}%)" for t, p in topics])
                subtopics_str = ", ".join([f"{t} ({p}%)" for t, p in subtopics])

                writer.writerow([vid, item["Title"], item["URL"], problem, solution, topics_str, subtopics_str])
                csvfile.flush()

                checkpoint_data[vid] = {
                    "Title": item["Title"],
                    "URL": item["URL"],
                    "Problem": problem,
                    "Solution": solution,
                    "Topics": topics,
                    "Subtopics": subtopics
                }
                with open(checkpoint_file, "w", encoding="utf-8") as f:
                    json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)

                print(f"Saved Video {vid}")
                break


# === Run Script ===
if __name__ == "__main__":
    try:
        process_transcripts(
            json_file=TRANSCRIPTS_FILE,
            output_csv=OUTPUT_DIR / "GPT_5_Mini_Transcripts_Summary.csv",
            checkpoint_file=OUTPUT_DIR / "gpt_5_mini_checkpoint.json"
        )
    except KeyboardInterrupt:
        print("\n Interrupted by user. Progress saved successfully.")

### 03_Transcript Analysis Results

In [None]:
# code showing examples from the transcript analysis
# Some basic visualizations