# ArGen GRPO Fine-Tuning Evaluation

This notebook demonstrates how to evaluate a fine-tuned model against the base model using the Dharmic ethical principles (Ahimsa, Satya, Dharma).

## Setup

First, let's install the required packages and import the necessary modules.

In [None]:
# Install required packages
!pip install -e ..
!pip install predibase matplotlib pandas

In [None]:
# Import required modules
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
import predibase as pb
from src.reward_functions import ahimsa_reward, satya_reward, dharma_reward
from src.data_utils.dataset import load_jsonl_dataset

## Authentication

Authenticate with Predibase using your API key.

In [None]:
# Set your Predibase API key
# You can also set this as an environment variable: export PREDIBASE_API_KEY=your_api_key
os.environ["PREDIBASE_API_KEY"] = "your_api_key_here"

# Initialize the Predibase client
pb.login()

## Load Test Cases

Load the test cases for evaluation.

In [None]:
# Load the test cases
test_cases_path = "../data/healthcare_examples.jsonl"
test_cases = load_jsonl_dataset(test_cases_path)

# Display a sample
test_cases[0]

## Set Up Models

Set up the base model and the fine-tuned model for evaluation.

In [None]:
# Set the model IDs
base_model_id = "microsoft/phi-3-mini-4k-instruct"  # Base model
ft_model_id = "your_fine_tuned_model_id_here"  # Fine-tuned model (from previous notebook)

# Create deployments
base_deployment = pb.deployments.create(
    name="phi-3-mini-base-eval",
    model=base_model_id,
    description="Base Phi-3 Mini model for evaluation"
)

ft_deployment = pb.deployments.create(
    name="argen-healthcare-eval",
    adapter=ft_model_id,
    description="ArGen healthcare model for evaluation"
)

## Define Evaluation Function

Define a function to evaluate the models using the Dharmic reward functions.

In [None]:
def evaluate_model(deployment, test_cases, num_samples=10):
    """
    Evaluate a model using the Dharmic reward functions.
    
    Args:
        deployment: Predibase deployment
        test_cases: List of test cases
        num_samples: Number of test cases to evaluate
        
    Returns:
        DataFrame with evaluation results
    """
    results = []
    
    # Limit the number of test cases if needed
    if num_samples < len(test_cases):
        test_cases = test_cases[:num_samples]
    
    for i, test_case in enumerate(test_cases):
        prompt = test_case["prompt"]
        
        # Generate a response
        response = deployment.generate(
            prompt=prompt,
            max_tokens=500,
            temperature=0.7
        )
        
        # Calculate rewards
        ahimsa_score = ahimsa_reward(prompt, response.text, test_case)
        satya_score = satya_reward(prompt, response.text, test_case)
        dharma_score = dharma_reward(prompt, response.text, test_case)
        total_score = (ahimsa_score + satya_score + dharma_score) / 3
        
        # Store results
        results.append({
            "prompt": prompt,
            "response": response.text,
            "ahimsa": ahimsa_score,
            "satya": satya_score,
            "dharma": dharma_score,
            "total": total_score
        })
        
        # Print progress
        print(f"Evaluated {i+1}/{len(test_cases)} test cases")
    
    return pd.DataFrame(results)

## Evaluate Models

Evaluate both the base model and the fine-tuned model.

In [None]:
# Evaluate the base model
print("Evaluating base model...")
base_results = evaluate_model(base_deployment, test_cases, num_samples=5)

# Evaluate the fine-tuned model
print("\nEvaluating fine-tuned model...")
ft_results = evaluate_model(ft_deployment, test_cases, num_samples=5)

## Compare Results

Compare the results of the base model and the fine-tuned model.

In [None]:
# Calculate average scores
base_avg = base_results[["ahimsa", "satya", "dharma", "total"]].mean()
ft_avg = ft_results[["ahimsa", "satya", "dharma", "total"]].mean()

# Display average scores
print("Base Model Average Scores:")
for key, value in base_avg.items():
    print(f"  {key}: {value:.4f}")

print("\nFine-tuned Model Average Scores:")
for key, value in ft_avg.items():
    print(f"  {key}: {value:.4f}")

# Calculate improvement
improvement = (ft_avg - base_avg) / base_avg * 100

print("\nImprovement (%)")
for key, value in improvement.items():
    print(f"  {key}: {value:.2f}%")

## Visualize Results

Visualize the comparison between the base model and the fine-tuned model.

In [None]:
# Set up the figure
fig, ax = plt.subplots(figsize=(10, 6))

# Set up the data
categories = ["Ahimsa", "Satya", "Dharma", "Total"]
base_scores = base_avg.values
ft_scores = ft_avg.values

# Set up the bar positions
x = range(len(categories))
width = 0.35

# Create the bars
ax.bar([i - width/2 for i in x], base_scores, width, label="Base Model")
ax.bar([i + width/2 for i in x], ft_scores, width, label="Fine-tuned Model")

# Add labels and title
ax.set_ylabel("Score")
ax.set_title("Comparison of Base Model and Fine-tuned Model")
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.legend()

# Set y-axis limits
ax.set_ylim(0, 1)

# Add value labels on top of bars
for i, v in enumerate(base_scores):
    ax.text(i - width/2, v + 0.02, f"{v:.2f}", ha="center")
    
for i, v in enumerate(ft_scores):
    ax.text(i + width/2, v + 0.02, f"{v:.2f}", ha="center")

plt.tight_layout()
plt.show()

## Detailed Comparison

Compare the responses of the base model and the fine-tuned model for a specific test case.

In [None]:
# Select a test case
test_case_idx = 0  # Change this to compare different test cases

# Get the prompt and responses
prompt = base_results.iloc[test_case_idx]["prompt"]
base_response = base_results.iloc[test_case_idx]["response"]
ft_response = ft_results.iloc[test_case_idx]["response"]

# Get the scores
base_scores = base_results.iloc[test_case_idx][["ahimsa", "satya", "dharma", "total"]]
ft_scores = ft_results.iloc[test_case_idx][["ahimsa", "satya", "dharma", "total"]]

# Display the comparison
print(f"Prompt: {prompt}\n")

print("Base Model Response:")
print(f"{base_response}\n")
print("Base Model Scores:")
for key, value in base_scores.items():
    print(f"  {key}: {value:.4f}")

print("\nFine-tuned Model Response:")
print(f"{ft_response}\n")
print("Fine-tuned Model Scores:")
for key, value in ft_scores.items():
    print(f"  {key}: {value:.4f}")

## Save Results

Save the evaluation results for future reference.

In [None]:
# Save the results
base_results.to_csv("../data/base_model_evaluation.csv", index=False)
ft_results.to_csv("../data/fine_tuned_model_evaluation.csv", index=False)

# Save the average scores
avg_scores = pd.DataFrame({
    "base_model": base_avg,
    "fine_tuned_model": ft_avg,
    "improvement_percent": improvement
})

avg_scores.to_csv("../data/average_scores.csv")

print("Results saved to data directory.")

## Conclusion

In this notebook, we evaluated the fine-tuned model against the base model using the Dharmic ethical principles (Ahimsa, Satya, Dharma). We showed how to:

1. Set up the base model and the fine-tuned model for evaluation
2. Define an evaluation function using the Dharmic reward functions
3. Evaluate both models on a set of test cases
4. Compare and visualize the results
5. Perform a detailed comparison of specific test cases
6. Save the evaluation results for future reference

The evaluation demonstrates the effectiveness of GRPO fine-tuning with Dharmic ethical principles in improving the model's alignment with these principles in a healthcare context.