### Zero-Shot Notebook (Benchmarking Pipeline):

The following notebook is used to answer Research Question #1 (RQ1) in the SLM Threat Query Analysis Project.

### Step #1: Import Packages:

Import all the packages that are needed to run the notebook. To install packages, you will need to run the following command in either your Terminal or in any of the cells:

```!pip install -r ../requirements.txt```

In [None]:
# Run this cell:
import pandas as pd
from kql_benchmarking_pipeline import KQLBenchmarkPipeline

import re
import time
import os
import tiktoken
import json
import yaml

from ZeroShot import *

with open("../config.yaml", "r") as f:
    token_config = yaml.safe_load(f)

### Step #2: Import Evaluation Dataset:

Read the Evaluation Dataset into a DataFrame, and create a new DataFrame to store the results.

In [None]:
# Run this cell:

eval_df = pd.read_json(path_or_buf='../NL2KQL_Remakes/data/evaluation/Defender_Evaluation.jsonl', lines=True)
df = pd.DataFrame(columns = ['NLQ', 'KQL'])

### Step #3: Specify the Model:

Next you will need to specify which type of model that you will use. There's multiple options that you can specify, below are some examples:

- HuggingFace-Based Models: 
    - "google/gemma-3-1b-it" (Google's Gemma-3-1B-IT model)
    - "google/gemma-3-4b-it" (Google's Gemma-3-4B-IT model)
    - "microsoft/phi-4" (Microsoft's Phi-4 model)
    - "microsoft/phi-4-mini-instruct" (Microsoft's Phi-4-Mini-Instruct model)

- GenAI-Based Models:
    - "gemini-2.0-flash" (Google's Gemini 2.0 Flash model)

- OpenAI-Based Models:
    - "gpt-4o" (OpenAI's GPT-4o model)
    - "gpt-5" (OpenAI's GPT-5 model)

You can add additional models that are not present in these lists, but you must ensure that they are in the correct specified format. For example, if you plan to load a model from Huggingface, you must set your ```mode``` variable to ```huggingface```. If you wish to load a model from OpenAI, you must set the ```mode``` variable to ```openai```, and if you wish to load a model using the Google GenAI client, you must specify the ```mode``` variable to ```genai-client```. In addition, fill in the variable below with the proper model to use. 

If you plan to use a different model, **YOU MAY NEED TO ADD FILTERING LOGIC TO ```ZeroShot.py```** file that is also present in the ```Benchmarking_Pipeline``` folder. Although most of the code is generalizable across multiple models, sidecases always do exist:

In [None]:
model_name = "google/gemma-3-1b-it"

# Three possible options for mode: huggingface, openai (only for GPT-models), genai-client (only for Gemini-based models):
mode = "huggingface"

# Specify which folder to store the .yaml files:
results_folder = ""
os.makedirs(results_folder, exist_ok = True)

### Step #4: Run the Pipeline:

The following cell runs the entire pipeline and collects the results.

In [None]:
latencies = []
input_tokens = []
output_tokens = []

# If it is a GPT Model, only run the results ONCE:
if mode == "openai":

    for i in range(0, 1):
        client = ZeroShot(model_name, mode, eval_df)
        pipeline = KQLBenchmarkPipeline(client)
        pipeline.run()
        pipeline.save_results(f"{results_folder}/{model_name}-zero-shot-{i}.yaml")
        latencies.append((sum(client.times))/(len(client.times)))

        revised_name = model_name

# All other models - run five times:
else:
    for i in range(0, 5):
        client = ZeroShot(model_name, mode, eval_df)
        pipeline = KQLBenchmarkPipeline(client)
        pipeline.run()
        
        if mode == 'huggingface':
            revised_name = re.search(r'\/(.*)', model_name, flags=re.DOTALL).group(1)
        else:
            revised_name = model_name
            
        pipeline.save_results(f"{results_folder}/{revised_name}-zero-shot-{i}.yaml")
        latencies.append((sum(client.times))/(len(client.times)))

        # Take the input/output tokens, and move the model to cpu:
        if mode == 'huggingface':
            input_tokens.append(client.input_tokens)
            output_tokens_local = 0
            for idx, row in client.df.iterrows():
                output_tokens_local = output_tokens_local + len(client.tokenizer(row['Full Response'][0]['generated_text'])["input_ids"])
            
            output_tokens.append(output_tokens_local)
            client.model = client.model.to("cpu")

        if mode == 'genai-client':
            genai_client = genai.Client(api_key = token_config['genai']['token'], http_options=HttpOptions(timeout=3*60*1000))
            
            input_tokens_local = 0
            output_tokens_local = 0
            count = 0
            
            for idx, row in client.df.iterrows():
                input_tokens_local += genai_client.models.count_tokens(model="gemini-2.0-flash", contents = f"You are a programmer using the Kusto Query Language with Microsoft Defender. Generate a KQL query that answers the following request. Return only the KQL code without any explanation. {eval_df.loc[count]['context']}").total_tokens
                output_tokens_local += genai_client.models.count_tokens(model="gemini-2.0-flash", contents = str(row['Full Response'])).total_tokens
                count = count + 1

            input_tokens.append(input_tokens_local)
            output_tokens.append(output_tokens_local)

In [None]:
num_queries = len(client.df)           # or client.df.shape[0]
print("Processed queries:", num_queries)

In [None]:
print(len(latencies))

### Step #5: Merge Results together

The following saves the results into a yaml file. Please note that while we have code that has done extensive regex cleaning, manual revision is still required in order to check for any extraneous characters that might affect the metrics scores.

In [None]:
import glob, yaml

# Merge any shards you have (adjust the glob if needed)
paths = sorted(glob.glob(f"{results_folder}/{revised_name}-zero-shot-*.yaml"))

merged = {"queries": []}
for p in paths:
    with open(p, "r") as f:
        y = yaml.safe_load(f) or {}
        merged["queries"].extend(y.get("queries", []))

with open(f"{results_folder}/{revised_name}-zero-shot.yaml", "w") as f:
    yaml.dump(merged, f, sort_keys=False, default_style='|', allow_unicode=True, width=1000)

### Step #6: Latency Output

Prints out the average latency for all queries.

In [None]:
# Run this cell:
print(f"Avg. Latency: {round((sum(latencies))/(len(latencies)), 3)}")

### Step #7: Cost Analysis:

Prints out the average cost for running all queries. We calculate the costs in different ways:
1. ##### HuggingFace Models:
   - We calculate the number of input and output tokens as the queries are run. Once we calculate the total number of input and output tokens, we save them into list variables and can calculate the total costs from there.
  
2. ##### OpenAI Models:
   - We use the ```tiktoken``` package to calculate the OpenAI model input and output tokens.
  
3. ##### GenAI-Client Models:
    - Due to potential risks of crashing, we calculate tokens AFTER all results have been obtained.
  
In ALL cases, you must update the input token costs and output token costs (per million tokens).

In [None]:
# Note: YOU MUST CHANGE THESE VALUES BASED ON MODEL PRICING:
llm_input_cost_per_million = 0
llm_output_cost_per_million = 0
cost = 0

if mode == 'openai':

    input_tokens = 0
    output_tokens = 0
    count = 0
    
    encoding = tiktoken.encoding_for_model(revised_name)

    # Change the 230 to an actual len number:
    for i in range(0,10):
        input_tokens = input_tokens + len(encoding.encode(f"You are a programmer using the Kusto Query Language with Microsoft Defender. Generate a KQL query that answers the following request.  Return only the KQL code without any explanation. {eval_df.loc[count]['context']}"))
        count = count + 1

    for entry in client.df['Full Response']:
        output_tokens = output_tokens + len(encoding.encode(entry))
    
    cost = ((llm_input_cost_per_million * input_tokens)/1000000) + ((llm_output_cost_per_million * output_tokens)/1000000)
    avg_cost = cost/1
    
    print(f"Average Total Cost: ${round(avg_cost, 3)}")

elif mode == 'huggingface' or mode == 'genai-client':

    for entry in list(zip(input_tokens, output_tokens)):
        cost += ((llm_input_cost_per_million * entry[0])/1000000) + ((llm_output_cost_per_million * entry[1])/1000000)

    avg_cost = cost/5
    print(f"Average Total Cost: ${round(avg_cost, 3)}")    

### Step #8: Get Metrics:

The following snippets will obtain all metrics for the results that you have just calculated. Please note that full path **must** be specified for ```file_of_interest``` and ```folder```. If you need help specifying the full path, use the ```!pwd``` command in Jupyter notebook.

In [None]:
# DO NOT CHANGE THE FOLLOWING VARIABLE:
runner = "../offline_metrics_pipeline/offline-metrics-pipeline/offline_metrics_runner.py"

# CHANGE THE FOLLOWING VARIABLES:

# This should point to where your .yaml is currently stored (entire path must be specified):
file_of_interest = ""

# These should point to where you would like to store your results (entire path must be specified):
folder = ""
results_file = "testing.csv"

In [None]:
!python {runner} {file_of_interest} {folder} {results_file}