🚀 AI Interaction Evaluation Framework

An AI-powered evaluation tool that analyzes chat interactions using Claude AI. Evaluates AI interaction patterns from chat.json files across multiple dimensions, specifically designed for 1-hour app creation experiments.

✨ Features

Chat Interaction Analysis: Evaluates AI interaction patterns and best practices
Claude Sonnet 4 Integration: Uses Anthropic's Claude AI for intelligent analysis
Comprehensive Chat File Discovery: Finds chat.json files in any directory structure
Multi-Branch Evaluation: Analyzes student branches following dev-FirstName-LastName_bcgprod format
Message Aggregation: Combines messages from all chat files in each branch for complete analysis
Multiple Chat Formats: Supports standard JSON arrays, GitHub Copilot format, and other chat file structures
Excel Export: Multiple sheets with detailed breakdowns and justifications

🛠️ Installation

# Clone the repository
git clone <repository-url>
cd prompt-evaluator-controller

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set API key
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"

🚀 Usage

Chat Interaction Evaluation

python scripts/chat_interaction_evaluator.py --api-key "your-key"

Command Line Options

--api-key      # Anthropic API key (required)
--config       # Configuration file path (default: configs/repositories.yaml)
--skip-creators # List of creators to skip evaluation
--output       # Output Excel file path

📊 Analysis Categories

Chat Interaction Analysis (1-Hour App Creation Focus)

Prompt Categorization

Rapid Prototyping: Quick implementation requests for core features
Feature Iteration: Refining and improving existing features
Problem Solving: Debugging, error fixing, technical issues
Architecture Decisions: High-level design and structure choices
Time Management: Time-related questions and constraints
Clarification: Understanding requirements or seeking examples
Other: Meta-questions or off-topic discussions

Interaction Dynamics

Conversation Metrics: Total messages, user/assistant message counts, conversation ratios
Message Characteristics: Average message lengths, message density per minute
Technical Engagement: Count of technical questions and development-focused interactions
Response Patterns: Quick vs detailed responses, response variety ratios
Development Flow: Analysis of conversation progression and efficiency

Emotion & Tone Detection

Sentiment Indicators: Analysis of urgent, focused, stressed, confident, and frustrated language patterns
Conversation Characteristics: Question/exclamation counts, message length analysis
AI Response Patterns: Encouraging, helpful, and technical response analysis
Sentiment Diversity: Measurement of emotional range and engagement variety
Interaction Quality: Sentiment ratios and overall conversation tone assessment

📁 Output Structure

Excel Report Sheets

Executive Summary: High-level overview with aggregated metrics
Prompt Categorization: Analysis of prompt types and patterns with counts and examples
Interaction Dynamics: Conversation flow and efficiency metrics with development phase analysis
Emotion & Tone Analysis: Sentiment tracking and tone assessment with time-pressure indicators
Chat Metrics: Conversation statistics and patterns across all chat files
All Branch Results: Complete breakdown of every branch analysis
Errors: Detailed error information and troubleshooting data

🔧 Recent Improvements

Message Aggregation Fix

Issue: Previously only analyzed the highest-scoring chat file per branch
Fix: Now aggregates all messages from all chat files in each branch
Result: Accurate total message counts and complete conversation analysis

GitHub Copilot Format Support

Issue: Could not parse GitHub Copilot chat.json format
Fix: Added _parse_github_copilot_format() method for proper extraction
Result: Supports both standard JSON arrays and GitHub Copilot request/response format

Analysis Structure Robustness

Issue: Analysis sheets showed 0 or N/A values due to structure mismatches
Fix: Added fallback logic for both nested and flat response structures
Result: Reliable data extraction regardless of AI response format

Enhanced Error Handling

Issue: Silent failures with no debugging information
Fix: Added comprehensive debug logging and error reporting
Result: Better troubleshooting and issue identification

Meaningful Column Redesign

Issue: Analysis sheets showed all zeros due to AI analysis failures
Fix: Replaced AI-dependent columns with data-driven metrics calculated from chat content
Result: Rich, meaningful data showing conversation patterns, technical engagement, and sentiment analysis

New Interaction Dynamics Metrics

Conversation Metrics: Total messages, user/assistant ratios, message density
Technical Analysis: Technical question counts, response patterns, development focus
Message Characteristics: Average message lengths, response variety, conversation flow

New Emotion & Tone Metrics

Sentiment Analysis: Keyword-based analysis of urgent, focused, stressed, confident, frustrated indicators
Conversation Quality: Question/exclamation counts, sentiment diversity, interaction patterns
AI Response Analysis: Encouraging, helpful, and technical response pattern detection

⚙️ Configuration

Configure repositories in configs/repositories.yaml:

repositories:
  streamlit:
    - name: "bhi-streamlit-a"
      url: "https://github.com/org/repo.git"
      language: "streamlit"
      framework: "streamlit"
  
  react_node:
    - name: "bhi-react-node-a"
      url: "https://github.com/org/repo.git"
      language: "javascript"
      framework: "react-node"
  
  react_fastapi:
    - name: "bhi-react-fastapi-a"
      url: "https://github.com/org/repo.git"
      language: "python"
      framework: "react-fastapi"

🔍 Troubleshooting

Common Issues

Low Message Counts

Symptom: Branches showing only 2 total messages
Cause: Chat files in unsupported format or parsing errors
Solution: Check if chat files are in supported format (JSON array, GitHub Copilot, or text)

Analysis Sheets Showing 0 or N/A

Symptom: Some columns in analysis sheets show empty values
Cause: AI analysis response structure mismatch (for AI-dependent columns only)
Solution: Most columns now use data-driven calculations; check debug output for remaining AI-dependent columns

API Connection Errors

Symptom: "Connection error" in analysis results
Cause: Invalid or missing Anthropic API key
Solution: Verify ANTHROPIC_API_KEY environment variable is set correctly

Debug Information

The tool now provides detailed debug output including:

Chat file discovery results
Message parsing statistics
Analysis response structure information
Error details for failed branches

New Column Types

Data-Driven Metrics (Reliable)

These columns are calculated directly from chat content and are always populated:

Message Counts: Total, user, and assistant message counts
Conversation Ratios: Assistant responses per user message
Message Characteristics: Average lengths, density, technical keywords
Sentiment Indicators: Keyword-based analysis of emotional language
Response Patterns: Quick vs detailed response analysis

AI-Dependent Metrics (May be N/A)

These columns depend on AI analysis and may show N/A if analysis fails:

Development Flow: High-level conversation flow assessment
Time Efficiency: Overall efficiency rating
Strategic Thinking: Strategic vs tactical thinking assessment
Overall Sentiment: AI-generated sentiment analysis
Stress Resilience: AI assessment of stress handling
Efficiency Mindset: AI assessment of efficiency focus

Supported Chat File Formats

Standard JSON Array: [{"role": "user", "content": "..."}, ...]
GitHub Copilot Format: {"requests": [{"message": {...}, "response": [...]}]}
Object with Messages: {"messages": [...]}
Text Format: Plain text with role indicators

📈 Performance Notes

Message Aggregation: All chat files in a branch are now processed together
Memory Usage: Large conversations may require more memory
API Calls: Three analysis calls per branch (Prompt Categorization, Interaction Dynamics, Emotion & Tone)
Processing Time: Depends on conversation length and API response times
Data-Driven Analysis: Most metrics are calculated locally from chat content, reducing API dependency
Reliability: Data-driven columns provide consistent results regardless of AI analysis success

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
outputs		outputs
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

BHI2025/prompt-evaluator-controller-external

Folders and files

Latest commit

History

Repository files navigation