[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-username/llm-finetuning-lora/blob/main/notebooks/3_evaluate_model.ipynb)

# 📊 Evaluating Fine-tuned CodeLlama on HumanEval

This notebook benchmarks your fine-tuned **CodeLlama-7b-Instruct** model on the **HumanEval** dataset, which is the gold standard for evaluating code generation capabilities.

## 🎯 What is HumanEval?

HumanEval is a benchmark consisting of 164 handwritten programming problems that evaluate the functional correctness of code synthesis. It measures:

- **Pass@1**: Percentage of problems solved correctly in one attempt
- **Pass@5**: Percentage of problems solved correctly in 5 attempts
- Code completion accuracy
- Algorithm implementation skills

## 🔧 Setup

First, let's install the required packages for evaluation.


In [None]:
%pip install -q torch transformers peft bitsandbytes accelerate datasets evaluate


### 📦 Import Dependencies

Let's import our evaluation module and other required libraries.


In [None]:
import sys
sys.path.append('.')

from evaluate.evaluate_model import ModelEvaluator
import torch
import json
from datetime import datetime

# Check GPU availability
print(f"🚀 CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"🎮 GPU: {torch.cuda.get_device_name(0)}")
    print(f"💾 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")


## 🤖 Initialize Model Evaluator

Now let's initialize our model evaluator with the fine-tuned CodeLlama model. This will load both the base model and your trained LoRA adapter.


In [None]:
# 🚀 Initialize the model evaluator
print("🤖 Initializing model evaluator...")

evaluator = ModelEvaluator(
    base_model_name="codellama/CodeLlama-7b-Instruct-hf",
    adapter_path="outputs/checkpoints",  # Path to your fine-tuned LoRA adapter
    device="auto",
    load_8bit=True,  # Use 8-bit for evaluation efficiency
    output_dir="outputs/evaluation"
)

print("✅ Model evaluator initialized successfully!")


## 🧪 Quick Test Generation

Before running the full evaluation, let's test the model on a simple HumanEval-style problem to ensure everything is working correctly.


In [None]:
# 🧪 Test with a sample HumanEval-style problem
test_prompt = """def has_close_elements(numbers: List[float], threshold: float) -> bool:
    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    \"\"\"
"""

print("🧪 Testing with sample problem:")
print("=" * 50)
print(test_prompt)
print("=" * 50)

# Generate solution
test_solution = evaluator.generate_code(
    prompt=test_prompt,
    max_new_tokens=200,
    temperature=0.1
)

print("\n🎯 Generated Solution:")
print("=" * 50)
print(test_solution)
print("=" * 50)


## 📊 Run Full HumanEval Evaluation

Now let's run the complete HumanEval evaluation. This will test your model on all 164 problems and calculate the Pass@1 score.

⚠️ **Note**: This will take 15-30 minutes depending on your GPU. The evaluation runs each problem sequentially to ensure accurate results.


In [None]:
# 📊 Run the full HumanEval evaluation
print("📊 Starting HumanEval evaluation...")
print("⏳ This will take 15-30 minutes. Please be patient!")

start_time = datetime.now()

# Run evaluation with different temperature settings
evaluation_configs = [
    {"temperature": 0.1, "name": "Conservative (T=0.1)"},
    {"temperature": 0.2, "name": "Balanced (T=0.2)"},
]

all_results = []

for config in evaluation_configs:
    print(f"\n🔥 Running evaluation with {config['name']}")
    
    results = evaluator.evaluate_humaneval(
        temperature=config["temperature"],
        max_new_tokens=256
    )
    
    results["config_name"] = config["name"] 
    all_results.append(results)
    
    # Print intermediate results
    evaluator.print_evaluation_summary(results)

end_time = datetime.now()
print(f"\n⏰ Total evaluation time: {end_time - start_time}")


## 📈 Results Analysis

Let's analyze the results and compare different configuration settings.


In [None]:
# 📈 Compare results across different configurations
print("📈 COMPARATIVE RESULTS ANALYSIS")
print("=" * 70)

for i, results in enumerate(all_results):
    print(f"\n{i+1}. {results['config_name']}")
    print(f"   Pass@1 Score: {results['pass_at_1']:.3f} ({results['pass_at_1']*100:.1f}%)")
    print(f"   Solved: {results['correct_solutions']}/{results['total_problems']}")

# Find best configuration
best_result = max(all_results, key=lambda x: x['pass_at_1'])
print(f"\n🏆 Best Configuration: {best_result['config_name']}")
print(f"🎯 Best Pass@1 Score: {best_result['pass_at_1']:.3f}")

# Show some example successful solutions
print(f"\n📝 Example Successful Solutions:")
print("=" * 50)

successful_examples = [
    example for example in best_result['detailed_results'] 
    if example['is_correct']
][:3]  # Show first 3 successful solutions

for i, example in enumerate(successful_examples, 1):
    print(f"\n✅ Example {i}: {example['task_id']}")
    print(f"Generated Code:")
    print(example['generated_code'][:200] + "..." if len(example['generated_code']) > 200 else example['generated_code'])
    print("-" * 30)


## 💾 Save Results

The evaluation results are automatically saved, but let's also create a summary report.


In [None]:
# 💾 Create and save summary report
summary_report = {
    "evaluation_date": datetime.now().isoformat(),
    "model_info": {
        "base_model": "codellama/CodeLlama-7b-Instruct-hf",
        "adapter_path": "outputs/checkpoints",
        "fine_tuning_dataset": "CodeAlpaca-20k"
    },
    "evaluation_summary": {
        "total_problems": 164,
        "configurations_tested": len(all_results),
        "best_pass_at_1": best_result['pass_at_1'],
        "best_configuration": best_result['config_name']
    },
    "detailed_results": all_results
}

# Save summary report
summary_file = f"outputs/evaluation/summary_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
with open(summary_file, 'w') as f:
    json.dump(summary_report, f, indent=2)

print(f"📋 Summary report saved to: {summary_file}")

# Print final summary
print("\n🎯 FINAL EVALUATION SUMMARY")
print("=" * 50)
print(f"📅 Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"🤖 Model: CodeLlama-7b-Instruct (Fine-tuned)")
print(f"📊 Dataset: CodeAlpaca-20k")
print(f"🏆 Best Pass@1: {best_result['pass_at_1']:.3f} ({best_result['pass_at_1']*100:.1f}%)")
print(f"✅ Problems Solved: {best_result['correct_solutions']}/164")
print("=" * 50)

print(f"\n💡 Results files saved in: outputs/evaluation/")
print(f"📋 Summary report: {summary_file}")
print(f"📊 Detailed results: {best_result.get('results_file', 'humaneval_results_*.json')}")
