## Evaluation

### Intrinsic Evaluation

We measure the discrimination abilities of LLMs with four intrinsic metrics:

1. **Discrimination Accuracy (Acc)**: Given a pair of correct and wrong programs, we calculate the percentage where the correct program obtains a higher discrimination score than the wrong one.

2. **Classification Macro F1 (F1)**: We treat "correct" and "wrong" as two classes and compute the macro average of F1 scores on these two labels.

3. **Hit@1 (H@1)**: Given a batch of candidate programs, we calculate the percentage where the highest-scoring candidate is correct.

4. **Mean Reciprocal Rank (MRR)**: We compute the standard MRR score by the highest-ranking correct program in the batches.


Legacy script: `scripts\intrin_eval\intrin_eval_text2sql_ft.sh`

### Datasets:
1. [Spider](https://yale-lily.github.io/spider)

In [1]:
import os
from evaluators.llm_evaluator import LLMEvaluator, LLMLoraEvaluator
from utils.functions import set_seed_all
from utils.functions import eval_intrinsic


In [2]:
evaluator_names =[
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    "stabilityai/stable-code-3b",
    "deepseek-ai/deepseek-coder-1.3b-base"
]


# generate lora model names
lora_model_names = []
for m in evaluator_names:
   lora_model_names.append( m.split("/")[1]+"_spider")

In [3]:
model_indx = 2 # choose the model to evaluate

evaluator_name = evaluator_names[model_indx]
model_savename = lora_model_names[model_indx]
print(f"evaluator_name: {evaluator_name}")
print(f"model_savename: {model_savename}")

current_directory = os.getcwd() #parameters
model_savedatapath = os.path.join(current_directory,f"checkpts/{model_savename}/model")
evaluator_peft_dir = model_savedatapath

seed = 42
test_fname = "data/spider_intrin_eval.json"
log_name = f"{model_savename}_pro.json"
dataset_name = "spider"
db_path ="spider/database"
evaluation_config = "evaluation_configs/pro.json"

"""
yes_token_indx: 
    the index of the token in the vocabulary that corresponds to the "Yes" text.
    CodeLlama-Instruct: "No" 1939 "Yes" 3869
    TinyLlama: "Yes" 3869
"""
yes_token_indx=None#3869

evaluator_name: stabilityai/stable-code-3b
model_savename: stable-code-3b_spider


In [4]:
# set seed
set_seed_all(seed)

In [5]:
evaluator = LLMEvaluator(evaluator_name, db_path, device="cuda",yes_token_indx=yes_token_indx)
#evaluator = LLMLoraEvaluator(evaluator_name, evaluator_peft_dir, db_path, device="cuda",yes_token_indx=yes_token_indx)

yindx=evaluator.get_yes_token()
print(f"Yes token index: {yindx}")    

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Yes token index: 4374


In [7]:
eval_intrinsic(evaluator, test_fname,evaluation_config,log_name)

  attn_output = torch.nn.functional.scaled_dot_product_attention(
100%|██████████| 400/400 [01:15<00:00,  5.32it/s]


Pair Count: 409
PWS Acc: 0.8313              
SQL Count: 1221
Pos F1: 0.0000              
Neg F1: 0.7419              
Macro F1: 0.3709              
Hit @ 1: 0.6350              
MRR: 0.6727              



### Spider Dataset Keys Explanation

The Spider dataset contains various keys that help in evaluating text-to-SQL models. Below is an explanation of each key:

- **`db_id`**: The unique identifier of the database for the given query. This indicates which database schema the question belongs to.

- **`schema`**: The schema of the database, which includes information about tables, columns, and their relationships. This helps models understand the database structure.

- **`question`**: The natural language question asked by the user.  
  *Example:*  
  *"What is the name of the youngest employee?"*

- **`sql`**: The ground truth SQL query corresponding to the question.  
  *Example:*  
  ```sql
  SELECT name FROM employees ORDER BY age ASC LIMIT 1;
  ```

- **`exec_res`**: The execution result of the ground truth SQL query. This contains the actual output of running the query on the database.

- **`top_n`**: A list of the **top-N SQL completions** (candidate queries) generated by the model. These are ranked based on the model’s confidence scores.

- **`top_n_exec_res`**: The execution results of the **top-N SQL completions**. These contain the actual database outputs of the model’s predicted queries.

- **`top_n_label`**: A list of binary labels (`0` or `1`) for each SQL candidate in `top_n`.  
  - `1` → The query is **correct** (produces the expected output).  
  - `0` → The query is **incorrect** (does not produce the expected output).