In [None]:
%pip install -U fabric-data-agent-sdk

# ✅ TASK OVERVIEW

**🎯 Goal**  
Configure your **Data Agent Artifact** using AI instructions, data source setup, and few-shot examples — aiming for **100% correctness** on all questions for the Dataset of your choice.

## 📝 Pre-Requisite
**Ensure you have completed Step 1 in the Lab 3 Curate Your Data Agent Document**
The DataAgent created in Step 1 of the "Lab 3 - Evaluate Your Data Agent" Document will be the DataAgent used for this Notebook.

---

## 🧪 STEP 1: Use the Evaluation Notebook

1. **Update the `YOUR_DATA_AGENT_ARTIFACT_NAME`**  
   - Replace the placeholder with the name of your Data Agent Artifact from in Step 1 of the "Lab 3 Curate Your Data Agent" Document. 
2. **Update the `YOUR_ACCOUNT_NUMBER` Variable**
   - Replace the placeholder with your name

In [None]:
from fabric.dataagent.client import FabricDataAgentManagement
from fabric.dataagent.evaluation import evaluate_data_agent, get_evaluation_summary, get_evaluation_details, get_evaluation_summary_per_question
#from fabric.dataagent.evaluation.evaluator_api import add_ground_truth_batch
import openai
import os
import re

YOUR_DATA_AGENT_ARTIFACT_NAME = "EvalAgent-AGNT###"

YOUR_ACCOUNT_NUMBER = "###"

os.environ["OPENAI_API_VERSION"] = "2023-05-15"

data_agent = FabricDataAgentManagement(YOUR_DATA_AGENT_ARTIFACT_NAME)

In [None]:
SELECTABLE_ELEMENT_TYPES = [
    "semantic_model.table",
    "lakehouse_tables.table",
    "warehouse_tables.table",
    "kusto.table",
]

tables_list = []

res = data_agent.get_datasources()[0].get_configuration()
res.get('elements')

selected_dbs=[]

def is_child_selected(elements) -> bool:
    if(elements is None or len(elements) == 0):
        return False

    res = False
    for elem in elements:
        if elem.get('type') in SELECTABLE_ELEMENT_TYPES and elem.get("is_selected"):
            tables_list.append(elem.get("display_name"))
            res = True
    return res
    return is_child_selected(elem.get('children'))

for element in res.get('elements'):
    if(is_child_selected(element.get('children'))):
        selected_dbs.append(element.get('display_name'))

selected_dbs

selectedListStr = '(' + ', '.join(f'"{item}"' for item in selected_dbs) + ')'
print(selectedListStr)


## 🔄 STEP 2: Understanding the Groundtruth

This step takes the **groundtruths** and displays the **expected answers** and their associated **queries**.

💡 This is the reference we will use to evaluate your Data Agent's performance.

In [None]:
csv_path = "Files/ground_truth.csv" 

groundtruth_df = (spark.read
    .option("header", "true")
    .option("multiLine", "true")
    .csv(csv_path)
    .filter(f"benchmark IN {selectedListStr}")
    .limit(1000)
)

df_pandas_groundtruth = groundtruth_df.toPandas()
type(df_pandas_groundtruth)

display(df_pandas_groundtruth)

## 📊 STEP 3: Evaluation

This is the **actual evaluation step**, where a new **conversation thread** is started with the Data Agent for **each question**.

- We use an **LLM-critic approach** to assess the agent’s answer against the expected answer.  
  > 🧠 *LLM-critic*: A language model is used to **judge** whether the generated response matches the expected one — even when the phrasing differs.
  
- ✅ **Note**: We **do not compare** the SQL queries. Only the **final answer text** is evaluated.

---

### 📄 Evaluation Output Columns

| Column           | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| **T**            | ✅ Number of correct answers (matches expected output)                      |
| **F**            | ❌ Number of incorrect answers                                               |
| **?**            | 🤔 Unclear — the LLM critic couldn’t decide                                 |
| **%**            | ✔️ Percentage of T out of total (T + F + ?)                                 |
| **failed thread**| 🔗 Links to failed threads (for manual inspection)                          |
| **question**     | 📝 The question from the ground truth                                       |

> 🔁 You can use `num_query_repeat` to ask the same question multiple times to assess variability in responses.

---

### ✅ Next Steps

- Publish your DataAgent
- Inspect the threads that **have mismatches**
- Update:
  - AI instructions
  - Data source instructions
  - Few-shot examples  
- **Re-run the evaluation**


In [None]:

# The table used to store evaluation results
evaluation_table_name = f'evaluation_output_{YOUR_ACCOUNT_NUMBER}'

# run evaluation

eval_id = evaluate_data_agent(
    df = df_pandas_groundtruth,
    data_agent_name = YOUR_DATA_AGENT_ARTIFACT_NAME,
    data_agent_stage = "production",
    table_name = evaluation_table_name)

df = get_evaluation_summary_per_question(eval_id, verbose=True, table_name = evaluation_table_name)