
# 🤖 Chatbot Experiment & Evaluation Notebook

This notebook evaluates chatbot performance based on:
- Step-by-step decision-making (intent → tool → response)
- Policy matching accuracy
- Semantic similarity between expected vs generated responses


In [None]:

import json
import pandas as pd
import matplotlib.pyplot as plt

with open("evaluation/semantic_eval_results.json") as f:
    sem = pd.DataFrame(json.load(f))

with open("evaluation/policy_eval_semantic.json") as f:
    pol = pd.DataFrame(json.load(f))


In [None]:

plt.figure(figsize=(8, 4))
plt.hist(sem["semantic_score"], bins=10, color="lightblue", edgecolor="black")
plt.axvline(0.55, color="red", linestyle="--", label="Threshold = 0.55")
plt.title("LLM Semantic Score Distribution")
plt.xlabel("Cosine Similarity")
plt.ylabel("Test Count")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:

metrics = {
    "Intent Accuracy": sem["intent_ok"].mean(),
    "Tool Accuracy": sem["tool_ok"].mean(),
    "Response Accuracy": sem["response_ok"].mean(),
    "Confidence ≥ 0.75": sem["confidence_ok"].mean(),
    "End-to-End Accuracy": sem["passed"].mean()
}

plt.figure(figsize=(8, 4))
plt.bar(metrics.keys(), [v * 100 for v in metrics.values()], color="seagreen")
plt.ylabel("Accuracy (%)")
plt.title("End-to-End Evaluation Metrics")
plt.xticks(rotation=45)
plt.ylim(0, 110)
plt.grid(axis="y")
plt.tight_layout()
plt.show()

metrics


In [None]:

labels = [f"Test {i+1}" for i in range(len(pol))]
colors = ['green' if r else 'red' for r in pol["response_ok"]]

plt.figure(figsize=(10, 4))
plt.bar(labels, pol["semantic_score"], color=colors)
plt.axhline(0.55, linestyle='--', color='gray', label="Semantic Threshold (0.55)")
plt.title("Policy Response Semantic Confidence")
plt.xlabel("Test Case")
plt.ylabel("Cosine Similarity")
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:

failed_cases = sem[~sem["passed"]]
failed_cases[["message", "predicted_intent", "predicted_tool", "semantic_score", "matched_sentence"]]


In [None]:

for i, row in sem.iterrows():
    print(f"--- Test {i+1} ---")
    print(f"Message: {row['message']}")
    print(f"Predicted Intent → {row['predicted_intent']}")
    print(f"Tool Used       → {row['predicted_tool']}")
    print(f"Matched Sentence→ {row['matched_sentence']}")
    print(f"Semantic Score  → {row['semantic_score']:.2f}")
    print(f"Passed          → {'✅' if row['passed'] else '❌'}\n")



## 🔍 Summary & Insights

**Performance:**
- ✅ Intent recognition is high
- ⚠️ Tool routing failed in 1 case
- ✅ Most LLM replies match expectations semantically
- ✅ 5 out of 6 tests were successful end-to-end

**Suggestions:**
- Improve routing logic for tools by using more context
- Tune or summarize LLM responses to boost semantic score
- Use multiple reference phrases to improve test coverage
