An AI-powered evaluation tool that analyzes chat interactions using Claude AI. Evaluates AI interaction patterns from chat.json files across multiple dimensions, specifically designed for 1-hour app creation experiments.
- Chat Interaction Analysis: Evaluates AI interaction patterns and best practices
- Claude Sonnet 4 Integration: Uses Anthropic's Claude AI for intelligent analysis
- Comprehensive Chat File Discovery: Finds chat.json files in any directory structure
- Multi-Branch Evaluation: Analyzes student branches following
dev-FirstName-LastName_bcgprod
format - Message Aggregation: Combines messages from all chat files in each branch for complete analysis
- Multiple Chat Formats: Supports standard JSON arrays, GitHub Copilot format, and other chat file structures
- Excel Export: Multiple sheets with detailed breakdowns and justifications
# Clone the repository
git clone <repository-url>
cd prompt-evaluator-controller
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set API key
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"
python scripts/chat_interaction_evaluator.py --api-key "your-key"
--api-key # Anthropic API key (required)
--config # Configuration file path (default: configs/repositories.yaml)
--skip-creators # List of creators to skip evaluation
--output # Output Excel file path
- Rapid Prototyping: Quick implementation requests for core features
- Feature Iteration: Refining and improving existing features
- Problem Solving: Debugging, error fixing, technical issues
- Architecture Decisions: High-level design and structure choices
- Time Management: Time-related questions and constraints
- Clarification: Understanding requirements or seeking examples
- Other: Meta-questions or off-topic discussions
- Conversation Metrics: Total messages, user/assistant message counts, conversation ratios
- Message Characteristics: Average message lengths, message density per minute
- Technical Engagement: Count of technical questions and development-focused interactions
- Response Patterns: Quick vs detailed responses, response variety ratios
- Development Flow: Analysis of conversation progression and efficiency
- Sentiment Indicators: Analysis of urgent, focused, stressed, confident, and frustrated language patterns
- Conversation Characteristics: Question/exclamation counts, message length analysis
- AI Response Patterns: Encouraging, helpful, and technical response analysis
- Sentiment Diversity: Measurement of emotional range and engagement variety
- Interaction Quality: Sentiment ratios and overall conversation tone assessment
- Executive Summary: High-level overview with aggregated metrics
- Prompt Categorization: Analysis of prompt types and patterns with counts and examples
- Interaction Dynamics: Conversation flow and efficiency metrics with development phase analysis
- Emotion & Tone Analysis: Sentiment tracking and tone assessment with time-pressure indicators
- Chat Metrics: Conversation statistics and patterns across all chat files
- All Branch Results: Complete breakdown of every branch analysis
- Errors: Detailed error information and troubleshooting data
- Issue: Previously only analyzed the highest-scoring chat file per branch
- Fix: Now aggregates all messages from all chat files in each branch
- Result: Accurate total message counts and complete conversation analysis
- Issue: Could not parse GitHub Copilot chat.json format
- Fix: Added
_parse_github_copilot_format()
method for proper extraction - Result: Supports both standard JSON arrays and GitHub Copilot request/response format
- Issue: Analysis sheets showed 0 or N/A values due to structure mismatches
- Fix: Added fallback logic for both nested and flat response structures
- Result: Reliable data extraction regardless of AI response format
- Issue: Silent failures with no debugging information
- Fix: Added comprehensive debug logging and error reporting
- Result: Better troubleshooting and issue identification
- Issue: Analysis sheets showed all zeros due to AI analysis failures
- Fix: Replaced AI-dependent columns with data-driven metrics calculated from chat content
- Result: Rich, meaningful data showing conversation patterns, technical engagement, and sentiment analysis
- Conversation Metrics: Total messages, user/assistant ratios, message density
- Technical Analysis: Technical question counts, response patterns, development focus
- Message Characteristics: Average message lengths, response variety, conversation flow
- Sentiment Analysis: Keyword-based analysis of urgent, focused, stressed, confident, frustrated indicators
- Conversation Quality: Question/exclamation counts, sentiment diversity, interaction patterns
- AI Response Analysis: Encouraging, helpful, and technical response pattern detection
Configure repositories in configs/repositories.yaml
:
repositories:
streamlit:
- name: "bhi-streamlit-a"
url: "https://github.com/org/repo.git"
language: "streamlit"
framework: "streamlit"
react_node:
- name: "bhi-react-node-a"
url: "https://github.com/org/repo.git"
language: "javascript"
framework: "react-node"
react_fastapi:
- name: "bhi-react-fastapi-a"
url: "https://github.com/org/repo.git"
language: "python"
framework: "react-fastapi"
- Symptom: Branches showing only 2 total messages
- Cause: Chat files in unsupported format or parsing errors
- Solution: Check if chat files are in supported format (JSON array, GitHub Copilot, or text)
- Symptom: Some columns in analysis sheets show empty values
- Cause: AI analysis response structure mismatch (for AI-dependent columns only)
- Solution: Most columns now use data-driven calculations; check debug output for remaining AI-dependent columns
- Symptom: "Connection error" in analysis results
- Cause: Invalid or missing Anthropic API key
- Solution: Verify
ANTHROPIC_API_KEY
environment variable is set correctly
The tool now provides detailed debug output including:
- Chat file discovery results
- Message parsing statistics
- Analysis response structure information
- Error details for failed branches
These columns are calculated directly from chat content and are always populated:
- Message Counts: Total, user, and assistant message counts
- Conversation Ratios: Assistant responses per user message
- Message Characteristics: Average lengths, density, technical keywords
- Sentiment Indicators: Keyword-based analysis of emotional language
- Response Patterns: Quick vs detailed response analysis
These columns depend on AI analysis and may show N/A if analysis fails:
- Development Flow: High-level conversation flow assessment
- Time Efficiency: Overall efficiency rating
- Strategic Thinking: Strategic vs tactical thinking assessment
- Overall Sentiment: AI-generated sentiment analysis
- Stress Resilience: AI assessment of stress handling
- Efficiency Mindset: AI assessment of efficiency focus
- Standard JSON Array:
[{"role": "user", "content": "..."}, ...]
- GitHub Copilot Format:
{"requests": [{"message": {...}, "response": [...]}]}
- Object with Messages:
{"messages": [...]}
- Text Format: Plain text with role indicators
- Message Aggregation: All chat files in a branch are now processed together
- Memory Usage: Large conversations may require more memory
- API Calls: Three analysis calls per branch (Prompt Categorization, Interaction Dynamics, Emotion & Tone)
- Processing Time: Depends on conversation length and API response times
- Data-Driven Analysis: Most metrics are calculated locally from chat content, reducing API dependency
- Reliability: Data-driven columns provide consistent results regardless of AI analysis success