Skip to content

BHI2025/prompt-evaluator-controller-external

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ AI Interaction Evaluation Framework

An AI-powered evaluation tool that analyzes chat interactions using Claude AI. Evaluates AI interaction patterns from chat.json files across multiple dimensions, specifically designed for 1-hour app creation experiments.

✨ Features

  • Chat Interaction Analysis: Evaluates AI interaction patterns and best practices
  • Claude Sonnet 4 Integration: Uses Anthropic's Claude AI for intelligent analysis
  • Comprehensive Chat File Discovery: Finds chat.json files in any directory structure
  • Multi-Branch Evaluation: Analyzes student branches following dev-FirstName-LastName_bcgprod format
  • Message Aggregation: Combines messages from all chat files in each branch for complete analysis
  • Multiple Chat Formats: Supports standard JSON arrays, GitHub Copilot format, and other chat file structures
  • Excel Export: Multiple sheets with detailed breakdowns and justifications

πŸ› οΈ Installation

# Clone the repository
git clone <repository-url>
cd prompt-evaluator-controller

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set API key
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"

πŸš€ Usage

Chat Interaction Evaluation

python scripts/chat_interaction_evaluator.py --api-key "your-key"

Command Line Options

--api-key      # Anthropic API key (required)
--config       # Configuration file path (default: configs/repositories.yaml)
--skip-creators # List of creators to skip evaluation
--output       # Output Excel file path

πŸ“Š Analysis Categories

Chat Interaction Analysis (1-Hour App Creation Focus)

Prompt Categorization

  • Rapid Prototyping: Quick implementation requests for core features
  • Feature Iteration: Refining and improving existing features
  • Problem Solving: Debugging, error fixing, technical issues
  • Architecture Decisions: High-level design and structure choices
  • Time Management: Time-related questions and constraints
  • Clarification: Understanding requirements or seeking examples
  • Other: Meta-questions or off-topic discussions

Interaction Dynamics

  • Conversation Metrics: Total messages, user/assistant message counts, conversation ratios
  • Message Characteristics: Average message lengths, message density per minute
  • Technical Engagement: Count of technical questions and development-focused interactions
  • Response Patterns: Quick vs detailed responses, response variety ratios
  • Development Flow: Analysis of conversation progression and efficiency

Emotion & Tone Detection

  • Sentiment Indicators: Analysis of urgent, focused, stressed, confident, and frustrated language patterns
  • Conversation Characteristics: Question/exclamation counts, message length analysis
  • AI Response Patterns: Encouraging, helpful, and technical response analysis
  • Sentiment Diversity: Measurement of emotional range and engagement variety
  • Interaction Quality: Sentiment ratios and overall conversation tone assessment

πŸ“ Output Structure

Excel Report Sheets

  • Executive Summary: High-level overview with aggregated metrics
  • Prompt Categorization: Analysis of prompt types and patterns with counts and examples
  • Interaction Dynamics: Conversation flow and efficiency metrics with development phase analysis
  • Emotion & Tone Analysis: Sentiment tracking and tone assessment with time-pressure indicators
  • Chat Metrics: Conversation statistics and patterns across all chat files
  • All Branch Results: Complete breakdown of every branch analysis
  • Errors: Detailed error information and troubleshooting data

πŸ”§ Recent Improvements

Message Aggregation Fix

  • Issue: Previously only analyzed the highest-scoring chat file per branch
  • Fix: Now aggregates all messages from all chat files in each branch
  • Result: Accurate total message counts and complete conversation analysis

GitHub Copilot Format Support

  • Issue: Could not parse GitHub Copilot chat.json format
  • Fix: Added _parse_github_copilot_format() method for proper extraction
  • Result: Supports both standard JSON arrays and GitHub Copilot request/response format

Analysis Structure Robustness

  • Issue: Analysis sheets showed 0 or N/A values due to structure mismatches
  • Fix: Added fallback logic for both nested and flat response structures
  • Result: Reliable data extraction regardless of AI response format

Enhanced Error Handling

  • Issue: Silent failures with no debugging information
  • Fix: Added comprehensive debug logging and error reporting
  • Result: Better troubleshooting and issue identification

Meaningful Column Redesign

  • Issue: Analysis sheets showed all zeros due to AI analysis failures
  • Fix: Replaced AI-dependent columns with data-driven metrics calculated from chat content
  • Result: Rich, meaningful data showing conversation patterns, technical engagement, and sentiment analysis

New Interaction Dynamics Metrics

  • Conversation Metrics: Total messages, user/assistant ratios, message density
  • Technical Analysis: Technical question counts, response patterns, development focus
  • Message Characteristics: Average message lengths, response variety, conversation flow

New Emotion & Tone Metrics

  • Sentiment Analysis: Keyword-based analysis of urgent, focused, stressed, confident, frustrated indicators
  • Conversation Quality: Question/exclamation counts, sentiment diversity, interaction patterns
  • AI Response Analysis: Encouraging, helpful, and technical response pattern detection

βš™οΈ Configuration

Configure repositories in configs/repositories.yaml:

repositories:
  streamlit:
    - name: "bhi-streamlit-a"
      url: "https://github.com/org/repo.git"
      language: "streamlit"
      framework: "streamlit"
  
  react_node:
    - name: "bhi-react-node-a"
      url: "https://github.com/org/repo.git"
      language: "javascript"
      framework: "react-node"
  
  react_fastapi:
    - name: "bhi-react-fastapi-a"
      url: "https://github.com/org/repo.git"
      language: "python"
      framework: "react-fastapi"

πŸ” Troubleshooting

Common Issues

Low Message Counts

  • Symptom: Branches showing only 2 total messages
  • Cause: Chat files in unsupported format or parsing errors
  • Solution: Check if chat files are in supported format (JSON array, GitHub Copilot, or text)

Analysis Sheets Showing 0 or N/A

  • Symptom: Some columns in analysis sheets show empty values
  • Cause: AI analysis response structure mismatch (for AI-dependent columns only)
  • Solution: Most columns now use data-driven calculations; check debug output for remaining AI-dependent columns

API Connection Errors

  • Symptom: "Connection error" in analysis results
  • Cause: Invalid or missing Anthropic API key
  • Solution: Verify ANTHROPIC_API_KEY environment variable is set correctly

Debug Information

The tool now provides detailed debug output including:

  • Chat file discovery results
  • Message parsing statistics
  • Analysis response structure information
  • Error details for failed branches

New Column Types

Data-Driven Metrics (Reliable)

These columns are calculated directly from chat content and are always populated:

  • Message Counts: Total, user, and assistant message counts
  • Conversation Ratios: Assistant responses per user message
  • Message Characteristics: Average lengths, density, technical keywords
  • Sentiment Indicators: Keyword-based analysis of emotional language
  • Response Patterns: Quick vs detailed response analysis

AI-Dependent Metrics (May be N/A)

These columns depend on AI analysis and may show N/A if analysis fails:

  • Development Flow: High-level conversation flow assessment
  • Time Efficiency: Overall efficiency rating
  • Strategic Thinking: Strategic vs tactical thinking assessment
  • Overall Sentiment: AI-generated sentiment analysis
  • Stress Resilience: AI assessment of stress handling
  • Efficiency Mindset: AI assessment of efficiency focus

Supported Chat File Formats

  1. Standard JSON Array: [{"role": "user", "content": "..."}, ...]
  2. GitHub Copilot Format: {"requests": [{"message": {...}, "response": [...]}]}
  3. Object with Messages: {"messages": [...]}
  4. Text Format: Plain text with role indicators

πŸ“ˆ Performance Notes

  • Message Aggregation: All chat files in a branch are now processed together
  • Memory Usage: Large conversations may require more memory
  • API Calls: Three analysis calls per branch (Prompt Categorization, Interaction Dynamics, Emotion & Tone)
  • Processing Time: Depends on conversation length and API response times
  • Data-Driven Analysis: Most metrics are calculated locally from chat content, reducing API dependency
  • Reliability: Data-driven columns provide consistent results regardless of AI analysis success

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages