## Knowledge Bottleneck Graph
- Loads and preprocesses Q&A data from `Final_Analysis.csv`
- Creates boolean columns (`IsQuestion`, `HasSolution`, `AnsweredQuestion`) to identify question presence and resolution status
- Filters to question rows with valid categories and aggregates metrics by `Category` (total questions, answered questions, resolution rate)
- Filters out categories with fewer than 5 questions (`MIN_QUESTIONS_THRESHOLD`)
- Calculates reference lines (overall mean resolution rate, median question count) to identify knowledge bottlenecks
- Creates an interactive bubble chart where bubble size represents unanswered questions, positioned by question volume (x-axis) and resolution rate (y-axis)
- Applies manual label positioning overrides for "Power" and "Conductors" categories to prevent overlap
- Exports the visualization as HTML, PNG, and SVG formats with publication-ready dimensions

## Configuration / Setup
- `MIN_QUESTIONS_THRESHOLD = 5` — minimum questions required for a category to be included
- `FONT_SIZE_PX = 13` — label and axis font size in pixels
- `LABEL_OFFSET_BOTTOM = (0, -25)`, `LABEL_OFFSET_TOP = (0, 25)` — default alternating label positions
- `LABEL_MANUAL_OVERRIDES` — dictionary for manual label positioning: `{'Power': (0, 40), 'Conductors': (0, -35)}`
- `SVG_DPI = 300`, `SVG_WIDTH_INCHES = 6.5` — publication SVG export dimensions
- Sentinel strings: `NO_QUESTION_SENTINEL = "No Question Present"`, `NO_SOLUTION_SENTINEL = "No Solution Present"`

## Inputs
- `PROCESSED_DIR / "Final_Analysis.csv"` (resolved: `data/processed/Final_Analysis.csv`)
    - Expected columns: `QuestionSummary`, `QuestionExcerpt`, `AnswerSummary`, `AnswerExcerpt`, `Category`, `IsQuestion` (computed), `HasSolution` (computed)

## Outputs
- `QA_FIGURES_DIR / "knowledge_bottleneck_bubble.html"` (resolved: `figures/QA/knowledge_bottleneck_bubble.html`) — interactive Plotly chart
- `QA_FIGURES_DIR / "knowledge_bottleneck_bubble.png"` (resolved: `figures/QA/knowledge_bottleneck_bubble.png`) — static PNG image (requires `kaleido`)
- `QA_FIGURES_DIR / "knowledge_bottleneck_bubble.svg"` (resolved: `figures/QA/knowledge_bottleneck_bubble.svg`) — publication-ready SVG (1950×1300 px at 300 DPI)
- Console output: diagnostic table showing aggregated metrics per category, mean resolution rate, and median question count
- In-memory DataFrames: `df` (full dataset), `df_q` (filtered questions), `agg` (category aggregates), `agg_sorted` (sorted by question count)

## Notes / Assumptions
- Requires `pandas`, `plotly`, and optionally `kaleido` (for PNG/SVG export)
- Assumes `QuestionSummary`, `QuestionExcerpt`, `AnswerSummary`, `AnswerExcerpt`, and `Category` columns exist in input CSV
- Questions are identified by non-empty, non-sentinel values in Question fields; solutions by non-empty, non-sentinel Answer fields
- Categories with `total_questions >= median_question_count` and `resolution_rate <= overall_mean_resolution` are flagged as bottlenecks
- Short labels are extracted by taking the first word/phrase before a comma in the category name
- Bubble size is normalized proportionally to `unanswered_questions` for consistent visual scaling
- The chart includes reference lines marking mean resolution rate and median question count to define the bottleneck quadrant (high volume, low resolution)

In [1]:
import sys
from pathlib import Path

# === Paths / Configuration ===
# Dynamically locate the repo root so this notebook works from any CWD
nb_dir = Path.cwd()
repo_root = nb_dir.parent if nb_dir.name == "notebooks" else nb_dir
if str(repo_root) not in sys.path:
    sys.path.insert(0, str(repo_root))

from src.paths import PROJECT_ROOT, DATA_DIR, PROCESSED_DIR, FIGURES_DIR

# === Imports ===
import json
import ast
import pandas as pd
from typing import Dict, Optional
import plotly.graph_objects as go
import plotly.express as px

# === Setup output directory ===
QA_FIGURES_DIR = FIGURES_DIR / "QA"
QA_FIGURES_DIR.mkdir(parents=True, exist_ok=True)

# === Reuse parsing logic ===
def parse_themes_cell(cell) -> Optional[Dict[str, float]]:
    """Parse the Themes cell robustly, returning dict or None."""
    if pd.isna(cell) or cell == "" or cell is None:
        return None
    cell_str = str(cell).strip()
    if not cell_str:
        return None
    try:
        fixed = cell_str.replace("'", '"')
        parsed = json.loads(fixed)
        if isinstance(parsed, dict):
            return {k: float(v) for k, v in parsed.items() if isinstance(v, (int, float))}
    except (json.JSONDecodeError, ValueError, TypeError):
        pass
    try:
        parsed = ast.literal_eval(cell_str)
        if isinstance(parsed, dict):
            return {k: float(v) for k, v in parsed.items() if isinstance(v, (int, float))}
    except (ValueError, SyntaxError, TypeError):
        pass
    return None

def get_top_theme(themes_dict: Optional[Dict[str, float]]) -> Optional[str]:
    """Get the theme with highest percentage."""
    if not themes_dict:
        return None
    sorted_themes = sorted(themes_dict.items(), key=lambda x: (-x[1], x[0]))
    if sorted_themes:
        return sorted_themes[0][0]
    return None

# === Load and preprocess data ===
FINAL_ANALYSIS_CSV = PROCESSED_DIR / "Final_Analysis.csv"
df = pd.read_csv(FINAL_ANALYSIS_CSV)
print(f"Loaded {len(df)} rows from {FINAL_ANALYSIS_CSV.name}\n")

# Define sentinel strings
NO_QUESTION_SENTINEL = "No Question Present"
NO_SOLUTION_SENTINEL = "No Solution Present"

# Create IsQuestion column
def is_question_present(row):
    q_summary = str(row.get("QuestionSummary", "")).strip()
    q_excerpt = str(row.get("QuestionExcerpt", "")).strip()
    return (
        (q_summary and q_summary != "nan" and q_summary != NO_QUESTION_SENTINEL) or
        (q_excerpt and q_excerpt != "nan" and q_excerpt != NO_QUESTION_SENTINEL)
    )

df["IsQuestion"] = df.apply(is_question_present, axis=1)

# Create HasSolution column
def has_solution_present(row):
    a_summary = str(row.get("AnswerSummary", "")).strip()
    a_excerpt = str(row.get("AnswerExcerpt", "")).strip()
    return (
        (a_summary and a_summary != "nan" and a_summary != NO_SOLUTION_SENTINEL) or
        (a_excerpt and a_excerpt != "nan" and a_excerpt != NO_SOLUTION_SENTINEL)
    )

df["HasSolution"] = df.apply(has_solution_present, axis=1)

# Create AnsweredQuestion column
df["AnsweredQuestion"] = df["IsQuestion"] & df["HasSolution"]

# Filter to question rows only with valid Category
df_q = df[(df["IsQuestion"] == True) & (df["Category"].notna())].copy()

# === Aggregate by Category ===
MIN_QUESTIONS_THRESHOLD = 5

agg = df_q.groupby("Category").agg(
    total_questions=("IsQuestion", "count"),
    answered_questions=("AnsweredQuestion", "sum")
).reset_index()

agg["unanswered_questions"] = agg["total_questions"] - agg["answered_questions"]
agg["resolution_rate"] = agg["answered_questions"] / agg["total_questions"]

# Filter out categories with too few questions
agg = agg[agg["total_questions"] >= MIN_QUESTIONS_THRESHOLD].copy()

# === Print Diagnostics Table ===
print("=== Knowledge Bottleneck Summary by Category ===\n")
print(agg.to_string(index=False))
print()

# === Calculate reference lines ===
overall_mean_resolution = agg["resolution_rate"].mean() * 100
median_question_count = agg["total_questions"].median()

print(f"Overall mean resolution rate: {overall_mean_resolution:.1f}%")
print(f"Median question count: {median_question_count:.0f}\n")

# === Sort by total_questions (left to right = low to high) ===
agg_sorted = agg.sort_values("total_questions").reset_index(drop=True)

# === Create short labels (first word/phrase before comma) ===
def get_short_label(category: str) -> str:
    """Extract first word or phrase (before comma) as short label."""
    if "," in category:
        return category.split(",")[0].strip()
    else:
        # Take first word if no comma
        return category.split()[0] if category.split() else category

agg_sorted["ShortLabel"] = agg_sorted["Category"].apply(get_short_label)

# === LABEL POSITIONS ===
# Default alternating positions (even index = bottom, odd index = top)
LABEL_OFFSET_BOTTOM = (0, -25)  # Below the bubble
LABEL_OFFSET_TOP = (0, 25)      # Above the bubble

# ============================================================
# MANUAL LABEL POSITION OVERRIDES
# ============================================================
# Dictionary mapping ShortLabel -> (xshift, yshift) in pixels
# Positive xshift = move right, negative = move left
# Positive yshift = move up, negative = move down
# 
# Add entries here to manually adjust specific label positions:
LABEL_MANUAL_OVERRIDES = {
    "Power": (0, 40),
    "Conductors": (0, -35),
}
# ============================================================

# Identify bottleneck quadrant points (for reference)
agg_sorted["IsBottleneck"] = (
    (agg_sorted["total_questions"] >= median_question_count) & 
    ((agg_sorted["resolution_rate"] * 100) <= overall_mean_resolution)
)

# === Font size: 10pt -> px (assuming 96 DPI: px = pt * 96/72) ===
FONT_SIZE_PX = 13  # 10pt ≈ 13.33px, rounded to 13

# === Create Bubble Chart ===
fig = go.Figure()

# Add bubbles with legend names (this creates the legend)
for idx, row in agg_sorted.iterrows():
    category = row["Category"]
    short_label = row["ShortLabel"]
    x_val = row["total_questions"]
    y_val = row["resolution_rate"] * 100
    bubble_size = row["unanswered_questions"]
    
    # Normalize size for consistent display
    size_normalized = bubble_size / max(agg_sorted["unanswered_questions"]) * 60 + 10
    
    fig.add_trace(go.Scatter(
        x=[x_val],
        y=[y_val],
        mode='markers',
        marker=dict(
            size=size_normalized,
            color=idx,
            colorscale='Viridis',
            cmin=0,
            cmax=len(agg_sorted) - 1,
            opacity=0.7,
            line=dict(width=1, color='black')
        ),
        name=f"{short_label}: {category}",  # Legend entry: "Short: Full Name"
        customdata=[category],
        hovertemplate=(
            f"<b>{category}</b><br>" +
            f"Total Questions: {x_val}<br>" +
            f"Resolution Rate: {y_val:.1f}%<br>" +
            f"Unanswered: {bubble_size}<extra></extra>"
        ),
        showlegend=True
    ))

# === Add short labels next to each point ===
for idx, row in agg_sorted.iterrows():
    short_label = row["ShortLabel"]
    x_val = row["total_questions"]
    y_val = row["resolution_rate"] * 100
    
    # Check for manual override first
    if short_label in LABEL_MANUAL_OVERRIDES:
        xshift, yshift = LABEL_MANUAL_OVERRIDES[short_label]
    else:
        # Alternate: even index = bottom, odd index = top
        if idx % 2 == 0:
            xshift, yshift = LABEL_OFFSET_BOTTOM
        else:
            xshift, yshift = LABEL_OFFSET_TOP
    
    fig.add_annotation(
        x=x_val,
        y=y_val,
        text=short_label,
        showarrow=False,
        xshift=xshift,
        yshift=yshift,
        font=dict(size=FONT_SIZE_PX, color="black"),
        bgcolor="rgba(211,211,211,0.8)",  # Light gray background
        borderpad=2
    )

# Add horizontal reference line (mean resolution rate)
fig.add_hline(
    y=overall_mean_resolution,
    line_dash="dash",
    line_color="gray",
    annotation_text=f"Mean Resolution: {overall_mean_resolution:.1f}%",
    annotation_position="top right",
    annotation_font=dict(size=FONT_SIZE_PX)
)

# Add vertical reference line (median question count)
fig.add_vline(
    x=median_question_count,
    line_dash="dash",
    line_color="gray",
    annotation_text=f"Median Questions: {median_question_count:.0f}",
    annotation_position="top right",
    annotation_font=dict(size=FONT_SIZE_PX)
)

# Add quadrant annotation (bottom-right = bottlenecks)
min_resolution = agg_sorted["resolution_rate"].min() * 100
bottleneck_y_position = min_resolution / 2 + 2.5  # Halfway down from min to 0

fig.add_annotation(
    x=agg_sorted["total_questions"].max() * 0.85,
    y=bottleneck_y_position,
    text="High Volume / Low Resolution<br>(Knowledge Bottlenecks)",
    showarrow=False,
    font=dict(size=FONT_SIZE_PX, color="darkred"),
    bgcolor="rgba(211,211,211,0.8)",
    bordercolor="darkred",
    borderwidth=1
)

fig.update_layout(
    # No title (removed for publication)
    font=dict(size=FONT_SIZE_PX),
    xaxis=dict(
        title=dict(text="Total Questions", font=dict(size=FONT_SIZE_PX)),
        tickfont=dict(size=FONT_SIZE_PX)
    ),
    yaxis=dict(
        title=dict(text="Resolution Rate (%)", font=dict(size=FONT_SIZE_PX)),
        tickfont=dict(size=FONT_SIZE_PX),
        range=[0, 105]
    ),
    width=1200,
    height=800,
    margin=dict(t=40, l=60, r=60, b=60),
    showlegend=True,
    legend=dict(
        title=dict(text="Categories", font=dict(size=FONT_SIZE_PX)),
        font=dict(size=FONT_SIZE_PX - 2),
        yanchor="top",
        y=0.99,
        xanchor="right",
        x=0.99,  # Inside the plot area (top-right corner)
        bgcolor="rgba(255,255,255,0.85)",
        bordercolor="gray",
        borderwidth=1
    ),
    hovermode='closest'
)

# === Save outputs using portable paths ===
BUBBLE_HTML = QA_FIGURES_DIR / "knowledge_bottleneck_bubble.html"
BUBBLE_PNG = QA_FIGURES_DIR / "knowledge_bottleneck_bubble.png"
BUBBLE_SVG = QA_FIGURES_DIR / "knowledge_bottleneck_bubble.svg"

fig.write_html(str(BUBBLE_HTML))
print(f"Interactive bubble chart saved to {BUBBLE_HTML.relative_to(PROJECT_ROOT)}")

# === SVG export: 6.5 inches at 300 DPI, preserve aspect ratio ===
SVG_DPI = 300
SVG_WIDTH_INCHES = 6.5
SVG_WIDTH_PX = int(SVG_WIDTH_INCHES * SVG_DPI)  # 1950 px
SVG_HEIGHT_PX = int(SVG_WIDTH_PX * (800 / 1200))  # Preserve aspect ratio -> 1300 px

# Save static images (requires kaleido)
try:
    fig.write_image(str(BUBBLE_PNG), scale=2)
    print(f"Static PNG saved to {BUBBLE_PNG.relative_to(PROJECT_ROOT)}")
except Exception as e:
    print(f"Note: Could not save PNG (requires kaleido): {e}")

try:
    fig.write_image(str(BUBBLE_SVG), format="svg", width=SVG_WIDTH_PX, height=SVG_HEIGHT_PX)
    print(f"Publication SVG saved to {BUBBLE_SVG.relative_to(PROJECT_ROOT)} ({SVG_WIDTH_INCHES}\" x {SVG_WIDTH_INCHES * (800/1200):.2f}\" at {SVG_DPI} DPI)")
except Exception as e:
    print(f"SVG export requires kaleido: {e}")

fig.show()

Loaded 16862 rows from Final_Analysis.csv

=== Knowledge Bottleneck Summary by Category ===

                                           Category  total_questions  answered_questions  unanswered_questions  resolution_rate
  Code Interpretation, Safety, and Field Operations             2689                 601                  2088         0.223503
          Conductors, Raceway, and Physical Routing             3665                 958                  2707         0.261392
       Devices, Lighting, and Utilization Equipment             1311                 336                   975         0.256293
           Grounding, Bonding, and Fault Management              660                 280                   380         0.424242
   Low-Voltage, Communications, and Control Systems              249                  45                   204         0.180723
          Motors, HVAC, and Specialized Power Loads              312                  76                   236         0.243590
           