# 🧪 Prompting Sandbox
Compare how different LLMs (Gemini, GPT-4, Hugging Face Transformers) respond to the same prompt.

- Designed for **in-session demos**
- Fully modular and editable
- Logs outputs for structured evaluation

**Models included**:
- OpenAI (gpt-4, gpt-3.5)
- Gemini Pro (via Google AI Studio)
- Hugging Face Transformers (e.g., Mistral, Zephyr, LLaMA, etc.)

🔒 *This notebook does not store or send keys—use environment variables or secret manager.*


In [None]:
# 🔧 Setup Cell — Load APIs and secrets
import os

# Example: Use environment variables (recommended)
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")  # placeholder
HF_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

# You can also use a .env file via dotenv
# from dotenv import load_dotenv
# load_dotenv()


In [None]:
# 📝 Define Prompt
task_description = """Summarize the main health risks from the policy text below in 3 bullet points.
Use plain language suitable for a community health workshop."""

context = """The national report on rural health identifies increased risk of untreated diabetes, mental health isolation, and reduced access to vaccination."""

full_prompt = f"{task_description}\n\n{context}"
print(full_prompt)


In [None]:
# 🔁 GPT-4 / GPT-3.5 Completion
import openai

openai.api_key = OPENAI_API_KEY

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": full_prompt}
    ],
    temperature=0.7
)

print("GPT-4 Output:")
print(response['choices'][0]['message']['content'])


In [None]:
# 🤗 Hugging Face Model via Transformers
from transformers import pipeline

summarizer = pipeline("text-generation", model="tiiuae/falcon-7b-instruct", token=HF_API_TOKEN)

output = summarizer(full_prompt, max_new_tokens=200)[0]['generated_text']
print("Hugging Face Output:")
print(output)


In [None]:
# 🔮 Gemini Pro via Google AI Studio — Placeholder
# This will depend on your preferred client interface (e.g., langchain or direct API)
# Example:
# from langchain_google_genai import ChatGoogleGenerativeAI
# model = ChatGoogleGenerativeAI(model="gemini-pro", google_api_key=GEMINI_API_KEY)

print("Gemini API integration requires manual setup.")


## 📊 Compare Outputs
Copy-paste responses here or use your own structured evaluation rubric.

Use this to compare:
- Clarity
- Relevance
- Format compliance
- Reasoning quality


## 📏 Evaluation Rubric

Use this rubric to manually assess each model's output on a 1–5 scale.

| Criterion         | Description                                 | Scale |
|------------------|---------------------------------------------|-------|
| Relevance        | Does the output address the task directly?  | 1–5   |
| Clarity          | Is the response easy to understand?         | 1–5   |
| Format Accuracy  | Does the output match the required format?  | 1–5   |
| Reasoning Logic  | Are examples or structure logical/coherent? | 1–5   |


In [None]:
# 🔍 Basic Manual Scoring Framework
def rubric_score(output, reference=None):
    print("Rate this output on the rubric (1-5):")
    relevance = int(input("Relevance: "))
    clarity = int(input("Clarity: "))
    format_accuracy = int(input("Format Accuracy: "))
    reasoning = int(input("Reasoning Logic: "))
    return {
        "relevance": relevance,
        "clarity": clarity,
        "format": format_accuracy,
        "reasoning": reasoning
    }


## 🎛 Temperature & Token Exploration

Test the same prompt under different settings to explore behavior.

**Suggested Ranges:**
- `temperature`: 0.3 (precise), 0.7 (balanced), 1.0 (creative)
- `max_tokens`: 100, 250, 400


In [None]:
# 🔁 Run GPT-4 with different temperature values
temperatures = [0.3, 0.7, 1.0]
for temp in temperatures:
    print(f"\n--- Temperature: {temp} ---")
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": full_prompt}
        ],
        temperature=temp,
        max_tokens=300
    )
    print(response['choices'][0]['message']['content'])


## 🧭 Additional Features (not implemented here)

- ✅ Add retry decorators using `tenacity` for production-grade API stability
- ✅ Fallback models if main model fails
- ✅ `ipywidgets` interface for dropdown selection of prompts or model versions
- ✅ Visual charts with matplotlib/plotly for output length, scoring trends
- ✅ Cost estimator comparing GPT-4 vs 3.5 vs local models
