## File Set-Up
- Sets up the Python path to include the repository root, ensuring imports from `src/` work correctly regardless of where the notebook is run
- Detects whether the notebook is in a `notebooks/` subdirectory or the repo root, then resolves the correct `repo_root`
- Imports shared path constants from `src.paths` (PROJECT_ROOT, RAW_DIR, PROCESSED_DIR, REFERENCE_DIR, FIGURES_DIR)
- Defines notebook-specific input and output file paths using the imported directory constants
- Prints the resolved project root and key file paths for verification

## Configuration / Setup
- **`nb_dir`**: Current working directory of the notebook
- **`repo_root`**: Automatically detected repository root (parent of `notebooks/` if applicable)
- **Input file**: `TRANSCRIPTS_FILE` = `data/raw/transcripts.json`
- **Output files**:
    - `TRANSCRIPT_SUMMARY_CSV` = `data/processed/Transcript_Summary.csv` (Step 1 output)
    - `TRANSCRIPT_SUMMARY_V2` = `data/processed/Transcript_SummaryV2.csv` (Step 2 output)
    - `CHECKPOINT_FILE` = `data/processed/transcript_summary_checkpoint.json`
    - `CHECKPOINT_FILE_V2` = `data/processed/transcript_analysis_checkpoint.json`

## Inputs
- **`src/paths.py`**: Module defining `PROJECT_ROOT`, `RAW_DIR`, `PROCESSED_DIR`, `REFERENCE_DIR`, `FIGURES_DIR`
- **Expected directory structure**: A `notebooks/` folder (optional) and a `src/` module at the repository root

## Outputs
- Console output showing:
    - Resolved project root path
    - Transcripts input file path
    - Processed output directory path
- No files are written by this cell; it only configures paths for downstream cells

## Notes / Assumptions
- Assumes `src/paths.py` exists and exports the required path constants
- Assumes the repository has either a flat structure or a `notebooks/` subdirectory
- Subsequent cells depend on the path variables defined here
- This is a setup cell and must be run before any data processing cells

In [None]:
# === Setup: Ensure repo root is on sys.path for imports ===
import sys
from pathlib import Path

nb_dir = Path.cwd().resolve()
repo_root = nb_dir.parent if nb_dir.name == "notebooks" else nb_dir
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

# === Import shared paths ===
from src.paths import PROJECT_ROOT, RAW_DIR, PROCESSED_DIR, REFERENCE_DIR, FIGURES_DIR

# === Notebook-specific file paths ===
# Input files
TRANSCRIPTS_FILE = RAW_DIR / "transcripts.json"

# Output files
TRANSCRIPT_SUMMARY_CSV = PROCESSED_DIR / "Transcript_Summary.csv"  # Step 1 output
TRANSCRIPT_SUMMARY_V2 = PROCESSED_DIR / "Transcript_SummaryV2.csv"  # Step 2 output
CHECKPOINT_FILE = PROCESSED_DIR / "transcript_summary_checkpoint.json"  # Step 1 checkpoint
CHECKPOINT_FILE_V2 = PROCESSED_DIR / "transcript_analysis_checkpoint.json"  # Step 2 checkpoint

print(f"Project root: {PROJECT_ROOT}")
print(f"Transcripts input: {TRANSCRIPTS_FILE}")
print(f"Step 1 output: {TRANSCRIPT_SUMMARY_CSV}")
print(f"Step 2 output: {TRANSCRIPT_SUMMARY_V2}")

## 01_Summarize_Transcripts
- Loads transcripts from `transcripts.json` and extracts metadata + generates summaries
- Uses OpenAI's GPT-5-mini model to generate single-paragraph summaries for each transcript
- Extracts VideoID, Title, URL, ViewCount, and LikeCount from the JSON data
- Skips videos already processed or with insufficient content (< 30 words)
- Saves progress incrementally after each summary to resume on interruption
- Implements a graceful shutdown handler (CTRL+C) that saves all progress before exiting
- Adds random 2–6 second delays between API calls to avoid rate limiting

## Configuration / Setup
- **`OUTPUT_FILE`**: Output CSV file = `data/processed/Transcript_Summary.csv`
- **`MODEL_NAME`**: OpenAI model = `"gpt-5-mini"`
- **`OPENAI_API_KEY`**: Environment variable loaded from `.env` file (required)
- **Prompt template**: Enforces a specific summary format via system message

## Inputs
- **`data/raw/transcripts.json`**: JSON file containing video transcripts with `VideoID`, `Title`, `URL`, `ViewCount`, `LikeCount`, and `Transcript` fields
- **`data/processed/transcript_summary_checkpoint.json`** (optional): Checkpoint file to resume from
- **OpenAI API**: GPT-5-mini chat completions endpoint

## Outputs
- **`data/processed/Transcript_Summary.csv`**: CSV with columns `VideoID`, `Title`, `URL`, `ViewCount`, `LikeCount`, `Summary`
- **`data/processed/transcript_summary_checkpoint.json`**: Checkpoint file tracking processed videos
- Console output showing processing status for each video

## Notes / Assumptions
- Assumes `transcripts.json` contains a list of dictionaries with `VideoID`, `Title`, `URL`, `ViewCount`, `LikeCount`, and `Transcript` keys
- Skips transcripts with fewer than 30 words or missing video IDs
- Depends on path variables defined in cell 1
- Uses a signal handler to catch CTRL+C and save progress before exiting
- Summaries marked as "No summary" indicate insufficient or missing transcript data

In [None]:
import json
import csv
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv
import os
import time
import random
import signal
import sys

# === CONFIG (using portable paths from configuration cell) ===
OUTPUT_FILE = TRANSCRIPT_SUMMARY_CSV
MODEL_NAME = "gpt-5-mini"

# === LOAD ENVIRONMENT VARIABLES ===
load_dotenv()
API_KEY = os.getenv("OPENAI_API_KEY")

if not API_KEY:
    raise ValueError("Missing OPENAI_API_KEY in .env file")

# === INIT OPENAI CLIENT ===
client = OpenAI(api_key=API_KEY)

# === LOAD TRANSCRIPTS ===
with open(TRANSCRIPTS_FILE, "r", encoding="utf-8") as f:
    transcripts = json.load(f)

# === LOAD CHECKPOINT ===
checkpoint_data = {}
if CHECKPOINT_FILE.exists():
    with open(CHECKPOINT_FILE, "r", encoding="utf-8") as f:
        checkpoint_data = json.load(f)
    print(f"Resuming from checkpoint with {len(checkpoint_data)} processed videos")

# === SIGNAL HANDLER TO SAVE ON CTRL+C ===
def handle_interrupt(signal_received, frame):
    print("\nCTRL+C detected — saving progress...")
    with open(CHECKPOINT_FILE, "w", encoding="utf-8") as f:
        json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
    print(f"Progress saved to: {CHECKPOINT_FILE}")
    sys.exit(0)

signal.signal(signal.SIGINT, handle_interrupt)

# === INITIALIZE OUTPUT CSV ===
write_header = not OUTPUT_FILE.exists()
csvfile = open(OUTPUT_FILE, "a", newline="", encoding="utf-8")
writer = csv.writer(csvfile)
if write_header:
    writer.writerow(["VideoID", "Title", "URL", "ViewCount", "LikeCount", "Summary"])

# === MAIN LOOP ===
try:
    for item in transcripts:
        video_id = item.get("VideoID")
        
        if not video_id:
            print("Skipping item — no VideoID found.")
            continue

        # Skip if already processed
        if video_id in checkpoint_data:
            continue

        title = item.get("Title", "")
        url = item.get("URL", "")
        view_count = item.get("ViewCount", 0)
        like_count = item.get("LikeCount", 0)
        transcript_text = item.get("Transcript", "").strip()

        # Skip empty or too short transcripts
        if len(transcript_text.split()) < 30:
            summary = "No summary"
            writer.writerow([video_id, title, url, view_count, like_count, summary])
            csvfile.flush()
            checkpoint_data[video_id] = {"Summary": summary}
            with open(CHECKPOINT_FILE, "w", encoding="utf-8") as f:
                json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
            print(f"Skipping {video_id} — not enough information.")
            continue

        try:
            # Ask GPT-5-mini for a consistent compact paragraph summary
            response = client.chat.completions.create(
                model=MODEL_NAME,
                messages=[
                    {
                        "role": "system",
                        "content": (
                            "You are a precise summarizer that outputs only a single short paragraph. "
                            "Follow this exact style:\n\n"
                            "This transcript explains [main topic], covering [key technical points]. "
                            "It discusses [relevant Electrical Engineering aspects] and explains [notable insight]. "
                            "Overall, it emphasizes [core conclusion or application]. "
                            "If there is not enough information, respond with exactly: No summary."
                        ),
                    },
                    {
                        "role": "user",
                        "content": f"Summarize the following transcript:\n\n{transcript_text}",
                    },
                ]
            )

            summary = response.choices[0].message.content.strip()

            # Handle empty or malformed outputs
            if not summary or summary.lower() == "no summary":
                summary = "No summary"

            # Write to CSV
            writer.writerow([video_id, title, url, view_count, like_count, summary])
            csvfile.flush()

            # Update checkpoint
            checkpoint_data[video_id] = {"Summary": summary}
            with open(CHECKPOINT_FILE, "w", encoding="utf-8") as f:
                json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)

            print(f"Summarized {video_id}")

            # Random delay (2–6 s)
            time.sleep(random.uniform(2, 6))

        except Exception as e:
            print(f"Error summarizing {video_id}: {e}")
            time.sleep(10)

finally:
    csvfile.close()

print(f"\nStep 1 complete! Output saved to: {OUTPUT_FILE}")

## 02_Problem/Solution and Topic Extraction
- Loads `Transcript_Summary.csv` from Step 1 and the original transcripts from `transcripts.json`
- Uses GPT-5-mini to extract Problem, Solution, Topics, and Subtopics from each full transcript
- Adds these four new columns to create `Transcript_SummaryV2.csv` (preserving all existing columns)
- Parses GPT output to identify topics and subtopics with percentage relevance
- Normalizes percentages across topics and subtopics to ensure they sum to 100%
- Implements checkpoint-based resume capability to skip already-processed videos
- Saves results incrementally to avoid data loss on interruption
- Retries failed API calls up to 2 times if invalid output is received

## Configuration / Setup
- **`INPUT_FILE`**: Input CSV = `data/processed/Transcript_Summary.csv` (from Step 1)
- **`OUTPUT_FILE`**: Output CSV = `data/processed/Transcript_SummaryV2.csv`
- **`MODEL_NAME`**: GPT-5-mini (specified in API request)
- **`MAX_RETRIES`**: Maximum retry attempts per video = `2`
- **Topic taxonomy**: `GENERAL_TOPICS` (8 categories) and `SUBTOPIC_GROUPS` (7 topic families with subtopics)

## Inputs
- **`data/processed/Transcript_Summary.csv`**: CSV with VideoID, Title, URL, ViewCount, LikeCount, Summary (from Step 1)
- **`data/raw/transcripts.json`**: JSON file containing full transcripts for Problem/Solution extraction
- **`data/processed/transcript_analysis_checkpoint.json`** (optional): Checkpoint file to resume from
- **OpenAI API**: GPT-5-mini chat completions endpoint

## Outputs
- **`data/processed/Transcript_SummaryV2.csv`**: CSV with all Step 1 columns plus `Problem`, `Solution`, `Topics`, `Subtopics`
- **`data/processed/transcript_analysis_checkpoint.json`**: Checkpoint file tracking processed videos
- Console output showing processing status for each video ID

## Notes / Assumptions
- Requires Step 1 (`Transcript_Summary.csv`) to be completed first
- Uses full transcript from JSON for extraction (not the summary)
- Depends on path variables defined in cell 1
- Topics and subtopics are formatted as "Label (X%)" strings in the CSV
- The `normalize_percentages` function ensures valid percentage distributions

In [None]:
import json
import csv
import pandas as pd
from openai import OpenAI
from dotenv import load_dotenv
import os
import re
import time
import random
import signal
import sys

# === CONFIG (using portable paths from configuration cell) ===
INPUT_FILE = TRANSCRIPT_SUMMARY_CSV  # Step 1 output
OUTPUT_FILE = TRANSCRIPT_SUMMARY_V2  # Step 2 output
MODEL_NAME = "gpt-5-mini"
MAX_RETRIES = 2

# === LOAD ENVIRONMENT VARIABLES ===
load_dotenv()
API_KEY = os.getenv("OPENAI_API_KEY")

if not API_KEY:
    raise ValueError("Missing OPENAI_API_KEY in .env file")

# === INIT OPENAI CLIENT ===
client = OpenAI(api_key=API_KEY)

# === Topic Definitions ===
GENERAL_TOPICS = [
    "Wiring", "Safety", "Tools", "Planning",
    "Education", "Technology", "Codes & Regulations", "Undefined"
]

SUBTOPIC_GROUPS = {
    "Wiring": ["Residential Wiring", "Circuit Design", "Conduit Wiring", "Panel Wiring", "Load Balancing"],
    "Safety": ["Tool Safety", "Electrical shock/fire safety", "PPE Usage", "Hazard Prevention"],
    "Tools": ["Tool Durability", "Tool Usage", "Measurement Tools", "Cutting Tools", "Power Tools"],
    "Planning": ["Wiring Planning", "Circuit Layout", "Load Calculation", "Electrical Design", "Construction Planning"],
    "Education": ["Apprenticeship", "Training", "Instruction", "Leadership Development", "Student Projects"],
    "Technology": ["Modern Tools", "Automation", "Smart Systems", "Prefabrication", "Innovation"],
    "Codes & Regulations": ["NEC Compliance", "Local Codes", "Permitting", "Building Codes", "Inspection Requirements"]
}


# === GPT Request Function ===
def extract_problem_solution_topics(transcript_text):
    """Send transcript to GPT-5-mini and get problem/solution + topic classification"""
    prompt = f"""
From the following transcript, extract:
1. Problem: The main challenge or issue being addressed.
2. Solution: A concise description of how the problem was solved.
3. Topics: Assign one or more general topics (from {", ".join(GENERAL_TOPICS)}) 
   with estimated percentage relevance. 
   Example: 
     - Wiring (40%)
     - Codes & Regulations (60%)
4. Subtopics: Assign subtopics under each selected topic with their respective percentages.
   Example:
     - Circuit Design (40%)
     - NEC Compliance (60%)

Rules:
- Percentages across topics and subtopics must total approximately 100%.
- Include at least one topic and one subtopic.
- Never use "Undefined" unless no suitable label applies.
- Be consistent and concise.

Respond **exactly** in this structure:

Problem: <text>
Solution: <text>
Topics:
  - <Topic 1> (<percent>%)
  - <Topic 2> (<percent>%)
Subtopics:
  - <Subtopic 1> (<percent>%)
  - <Subtopic 2> (<percent>%)

Transcript:
{transcript_text}
"""

    response = client.chat.completions.create(
        model=MODEL_NAME,
        messages=[
            {"role": "system", "content": "You are an assistant that extracts problems, solutions, and classifies electrical engineering transcripts into multiple topics with percentage relevance."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content


# === Result Parsing ===
def parse_result(result):
    """Parse GPT output into structured components"""
    problem, solution = "Not Extracted", "Not Extracted"
    topics, subtopics = [], []

    lines = result.splitlines()
    section = None

    for line in lines:
        if line.lower().startswith("problem"):
            problem = line.split(":", 1)[-1].strip()
        elif line.lower().startswith("solution"):
            solution = line.split(":", 1)[-1].strip()
        elif line.lower().startswith("topics"):
            section = "topics"
        elif line.lower().startswith("subtopics"):
            section = "subtopics"
        elif line.strip().startswith("-"):
            match = re.match(r"-\s*(.+?)\s*\((\d+)%\)", line.strip())
            if match:
                label, percent = match.groups()
                percent = int(percent)
                if section == "topics":
                    topics.append((label, percent))
                elif section == "subtopics":
                    subtopics.append((label, percent))

    # Normalize percentages so they sum to 100
    topics = normalize_percentages(topics)
    subtopics = normalize_percentages(subtopics)

    return problem, solution, topics, subtopics


def normalize_percentages(pairs):
    """Ensure percentages sum to 100 and fix rounding errors"""
    if not pairs:
        return []

    total = sum(p[1] for p in pairs)
    if total == 0:
        equal_share = round(100 / len(pairs))
        return [(p[0], equal_share) for p in pairs]

    normalized = [(p[0], round(p[1] * 100 / total)) for p in pairs]
    diff = 100 - sum(p[1] for p in normalized)
    if diff != 0:
        first = list(normalized[0])
        first[1] += diff
        normalized[0] = tuple(first)
    return normalized


# === LOAD TRANSCRIPTS JSON ===
with open(TRANSCRIPTS_FILE, "r", encoding="utf-8") as f:
    transcripts = json.load(f)

# Create lookup for transcripts
transcript_lookup = {
    item["VideoID"]: item["Transcript"]
    for item in transcripts
    if "VideoID" in item and "Transcript" in item
}

# === LOAD STEP 1 CSV ===
if not INPUT_FILE.exists():
    raise FileNotFoundError(f"Step 1 output not found: {INPUT_FILE}. Run Step 1 first.")

df = pd.read_csv(INPUT_FILE)
print(f"Loaded {len(df)} rows from Step 1 output")

# === LOAD CHECKPOINT ===
checkpoint_data = {}
if CHECKPOINT_FILE_V2.exists():
    with open(CHECKPOINT_FILE_V2, "r", encoding="utf-8") as f:
        checkpoint_data = json.load(f)
    print(f"Resuming from checkpoint with {len(checkpoint_data)} processed videos")

# === Add new columns if they don't exist ===
for col in ["Problem", "Solution", "Topics", "Subtopics"]:
    if col not in df.columns:
        df[col] = ""

# === SIGNAL HANDLER TO SAVE ON CTRL+C ===
def handle_interrupt(signal_received, frame):
    print("\nCTRL+C detected — saving progress...")
    df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8")
    with open(CHECKPOINT_FILE_V2, "w", encoding="utf-8") as f:
        json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)
    print(f"Progress saved to: {OUTPUT_FILE}")
    sys.exit(0)

signal.signal(signal.SIGINT, handle_interrupt)

# === MAIN LOOP ===
try:
    for i, row in df.iterrows():
        video_id = row.get("VideoID")
        
        if not video_id:
            continue

        # Skip if already processed
        if str(video_id) in checkpoint_data:
            # Restore from checkpoint
            df.at[i, "Problem"] = checkpoint_data[str(video_id)].get("Problem", "")
            df.at[i, "Solution"] = checkpoint_data[str(video_id)].get("Solution", "")
            df.at[i, "Topics"] = checkpoint_data[str(video_id)].get("Topics", "")
            df.at[i, "Subtopics"] = checkpoint_data[str(video_id)].get("Subtopics", "")
            continue

        # Get full transcript from JSON
        if video_id not in transcript_lookup:
            print(f"Skipping {video_id} — transcript not found in JSON.")
            df.at[i, "Problem"] = "Not Extracted"
            df.at[i, "Solution"] = "Not Extracted"
            df.at[i, "Topics"] = ""
            df.at[i, "Subtopics"] = ""
            continue

        transcript_text = transcript_lookup[video_id].strip()

        # Skip empty or too short transcripts
        if len(transcript_text.split()) < 30:
            df.at[i, "Problem"] = "Not Extracted"
            df.at[i, "Solution"] = "Not Extracted"
            df.at[i, "Topics"] = ""
            df.at[i, "Subtopics"] = ""
            checkpoint_data[str(video_id)] = {
                "Problem": "Not Extracted",
                "Solution": "Not Extracted",
                "Topics": "",
                "Subtopics": ""
            }
            print(f"Skipping {video_id} — transcript too short.")
            continue

        print(f"Processing {video_id}...")

        retries = 0
        success = False
        while retries <= MAX_RETRIES and not success:
            try:
                result = extract_problem_solution_topics(transcript_text)
                problem, solution, topics, subtopics = parse_result(result)

                if not topics:
                    retries += 1
                    print(f"Retry {retries} for {video_id} — no topics extracted.")
                    time.sleep(2)
                    continue

                # Convert to readable strings
                topics_str = ", ".join([f"{t} ({p}%)" for t, p in topics])
                subtopics_str = ", ".join([f"{t} ({p}%)" for t, p in subtopics])

                # Update dataframe
                df.at[i, "Problem"] = problem
                df.at[i, "Solution"] = solution
                df.at[i, "Topics"] = topics_str
                df.at[i, "Subtopics"] = subtopics_str

                # Update checkpoint
                checkpoint_data[str(video_id)] = {
                    "Problem": problem,
                    "Solution": solution,
                    "Topics": topics_str,
                    "Subtopics": subtopics_str
                }
                with open(CHECKPOINT_FILE_V2, "w", encoding="utf-8") as f:
                    json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)

                # Save CSV incrementally
                df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8")

                print(f"Extracted {video_id}")
                success = True

                # Random delay
                time.sleep(random.uniform(2, 5))

            except Exception as e:
                print(f"Error processing {video_id}: {e}")
                retries += 1
                time.sleep(5)

        if not success:
            df.at[i, "Problem"] = "Not Extracted"
            df.at[i, "Solution"] = "Not Extracted"
            df.at[i, "Topics"] = ""
            df.at[i, "Subtopics"] = ""

finally:
    # Final save
    df.to_csv(OUTPUT_FILE, index=False, encoding="utf-8")
    with open(CHECKPOINT_FILE_V2, "w", encoding="utf-8") as f:
        json.dump(checkpoint_data, f, indent=2, ensure_ascii=False)

print(f"\nStep 2 complete! Output saved to: {OUTPUT_FILE}")

### 03_Transcript Analysis Results

In [None]:
# code showing examples from the transcript analysis
# Some basic visualizations