### NL2KQL (Generalized):

The following notebook is a replication of [NL2KQL](https://arxiv.org/pdf/2404.02933). There are multiple main components involved in this pipeline:

- Schema Refiner
- Few-Shot Selector
- Prompt Builder

### Step #1: Import Necessary Packages:

Import all the packages that are needed to run the notebook. To install packages, you will need to run the following command in either your Terminal or in any of the cells:

```!pip install -r ../requirements.txt```

In [None]:
# Import Necessary Packages:

import os
import yaml
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

# Machine Learning-based packages:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
from huggingface_hub import login
from helpers import *

import pathlib
import textwrap
import time
import pandas
import json
import random

import warnings
warnings.filterwarnings('ignore')

from google import genai as genai_client
from openai import OpenAI
import tiktoken

with open("../config.yaml", "r") as f:
    token_config = yaml.safe_load(f)

Below you will need to specify the mode and model that you wish to test. You may choose from any of the following modes:

```mode```:
- genai-client
- openai

Please note that you will need to specify the model in the correct way.

In [None]:
# Put the API and model that you wish to test:
mode = "genai-client"
model_name = "gemini-2.0-flash"

client = genai_client.Client(api_key=token_config['genai']['token'])

if mode == "genai-client":
    client_two = genai_client.Client(api_key=token_config['genai']['token'])
elif mode == "openai":
    client_two = OpenAI(api_key = token_config['openai']['token'])

# Path to prompt template:
prompt_template_path = "prompt_template.txt"

# Should be a boolean (True/False):
value_placeholder = True

### Step #2: Creating the Few-Shot Embedding Store Database

The following box creates a Few-Shot Embedding Store Database. The purpose of this database is to create a variety of KQL examples that can be provided to the LLM in order to improve KQL code generation accuracy. Note that if an FSDB database has already been created, you do not need to run the block below (running the block below when an FSDB database exists will result in a message that says "FSDB already exists for Defender"). Instead run the second block below, which reads in the Defender FSDB database. The first block is **commented**, if you need to make changes to the FSDB then uncomment and run the cell.

If you need to make any changes to how you generate the FSDB, see the ```helpers.py``` file.

NOTE: For the purposes of this project, we focus exclusively on Microsoft Defender for the time being.

In [None]:
# themes = [
#     "Explore: Look for signs or hints of a security attack",
#     "Expansion: Searches for additional contextual understanding",
#     "Detect: Look for events related to a security attack",
#     "Remediate: Identify events for a given entity or asset",
#     "Report: Provide summary statistics for reporting"
# ]

# schema_file = "defender_fsdb_new.json"
# fsdb = generate_fsdb(themes, schema_file)

In [None]:
with open('data/fsdb/defender_fsdb.json', 'r') as f:
    fsdb = json.load(f)

### Step #3: Generating the Table Embedding Store (Semantic Data Catalog)

As part of the Semantic Data Catalog, there are two different types of embeddings: Table Embeddings and Value Embeddings. In the next few steps, we will first make generate a table embedding dictionary and later build a value embedding dictionary to simulate the table embedding and value embedding stores discussed in NL2KQL:

Note: If a Table Embeddings file has already been created, then run the third block below instead to load the Table Embeddings directly from a json file:

In [None]:
with open('data/miscellaneous/defender.yml', 'r') as file:
    defender_information = yaml.safe_load(file)
    
defender_embeddings = generate_table_embeddings(defender_information)

In [None]:
with open('table_embeddings.json', 'w') as f:
    json.dump(defender_embeddings, f)

If you have already run the two code block snippets above before, then run the following block instead:

In [None]:
# RUN THIS CELL:

with open('data/embeddings/defender_table_embeddings.json', 'r') as f:
    table_embeddings = json.load(f)

### Step #4: Generating the Value Embedding Stores (Semantic Data Catalog)

To preserve efficiency, we then find the embeddings of the columns of the filtered tables. Normally it would be efficient to store them all at once, but for this replication we just aim to generate the embeddings once we have the desired tables. We are bound by requests per day from the queries (1,500 per day) when querying Google Gemini 2.0, which is why we find the tables first, and then their respective embeddings.

Because the process takes a bit of time to generate, we do not supply the code here used to generate the embeddings (that is included in a separate notebook). Instead, just load the value embeddings through the block below:

In [None]:
# RUN THIS CELL:

with open('data/embeddings/defender_value_embeddings.json', 'r') as f:
    value_embeddings = json.load(f)

### Step #5: Creating the Entire Pipeline

Now that we have the necessary components (Table Embeddings, Value Embeddings, Few-Shot Embeddings, we can build out the entire pipeline:

In [None]:
# Load in the Evaluation Dataset:
eval_df = pd.read_json(path_or_buf='data/evaluation/Defender_Evaluation.jsonl', lines=True)

# If you only plan on testing a subset of queries, then alter these accordingly:
queries = list(eval_df['context'])
baselines = list(eval_df['baseline'])

In [None]:
# Results get stored in here:
df = pd.DataFrame()
query_count = 0
failed_queries = []

with open('data/DataCatalogs/Defender_DataCatalog.yml', 'r') as file:
    defender_catalog = yaml.safe_load(file)

elapsed_time = []
inputs = []

iterations = 0

if mode == 'genai-client':
    iterations = 5
else:
    iterations = 1

In [None]:
for i in range(0,iterations):

    for query_prompt in queries:
        
        llm_results_gemini = []
        fail = False
        
        # Generating the embedding for the query:
        query_response = get_query_embedding(query_prompt)
        
        # Get the relevant tables to the query:
        defender_embedding_vals = list(table_embeddings.values())
        cosine_similarities = [cosine_similarity(np.array(query_response).reshape(1,-1), np.array(entry).reshape(1,-1)) for entry in defender_embedding_vals]
        
        cosine_similarities_vals = [float(entry) for entry in cosine_similarities]
        top_9_idx = np.argsort(cosine_similarities_vals)[-9:]
        table_lst = list(table_embeddings.keys())
    
        filtered_tables = []
        for idx in top_9_idx:
            filtered_tables.append(table_lst[idx])
        
        # Get the relevant columns to the query:
        # with open('../../../../tmp/defender_value_embeddings_new_two.json', 'r') as f:
        #     value_embeddings = json.load(f)
        
        relevant_columns = dict()
    
        for k in filtered_tables:
            filter_lst = [entry for entry in list(value_embeddings.keys()) if k in entry]
            sub_dict = {key: value_embeddings[key] for key in filter_lst}
            
            cosine_similarities = []
            for key in sub_dict:
                cosine_similarity_val = cosine_similarity(np.array(value_embeddings[key]).reshape(1,-1), 
                                                          np.array(query_response).reshape(1,-1))
                cosine_similarities.append(cosine_similarity_val[0].item())
    
            top_5_cols_idx = np.argsort(cosine_similarities)[-5:]
    
            col_lst = list(sub_dict.keys())
            relevant_cols = []
            for idx in top_5_cols_idx:
                relevant_columns[col_lst[idx]] = cosine_similarities[idx]
        
        # Top 5 Values:
        final_vals = list(dict(sorted(relevant_columns.items(), key=lambda item: item[1], reverse=True)).keys())[0:5]
        final_vals_revised = []
        for entry in final_vals:
            try:
                final_vals_revised.append(re.search(r'Value Name:(.*)', entry).group(1))
            except:
                continue
            
        # Filter Few-Shot by Top t tables:
        filtered_fsdb = [entry for entry in fsdb if len(set(entry['tables']).intersection(set(filtered_tables))) > 0]
        
        # Semantic Similarity Matching:
        # nlq, f = 2
    
        fsdb_embeddings = []
        fsdb_count = 0
    
        for entry in filtered_fsdb:
            try:
                fsdb_response = client.models.embed_content(model = 'text-embedding-004',
                                                            contents = f"{entry['nlq']}")
            except:
                print('Error found - retrying again')
                time.sleep(60)
                fsdb_response = client.models.embed_content(model = 'text-embedding-004',
                                                            contents = f"{entry['nlq']}")
                
            fsdb_embeddings.append({'NLQ': entry['nlq'], 'Embedding': fsdb_response.embeddings[0].values, 'KQL': entry['kql']})
            fsdb_count = fsdb_count + 1
            # print(f"FSDB Embeddings Processed: {fsdb_count}")
            time.sleep(2)
            
        cosine_similarities_fsdb = [{'NLQ': entry['NLQ'], 'KQL': entry['KQL'], 'Similarity': float(cosine_similarity(np.array(query_response).reshape(1,-1), np.array(entry['Embedding']).reshape(1,-1)))} for entry in fsdb_embeddings]
    
        # Sort and find the Top 2 NLQ Entries:
        cosine_similarities_fsdb_sorted = sorted(cosine_similarities_fsdb, key=lambda x: x['Similarity'], reverse=True)[0:2]
    
        # Filter KQL Queries:
        cosine_similarities_fsdb_sorted = [{'NLQ': entry['NLQ'], 'KQL': entry['KQL'], 'Similarity': entry['Similarity']} for entry in cosine_similarities_fsdb_sorted]
        
        SCHEMA_PLACEHOLDER = ""
        EXAMPLES_PLACEHOLDER = ""
        USER_PLACEHOLDER = f"NLQ: {query_prompt} \n + KQL:"
        
        with open(prompt_template_path, 'r') as f:
            txt = f.read()
    
        # Add Table and Schema Information:
        for k in filtered_tables:
            temp_col_lst = []
            table = k
            
            for entry in defender_catalog:
                if entry['Name'] == table:
                    temp_col_lst = [subentry['Name'] for subentry in entry['Columns']]
                    
            #cols = relevant_columns[key]
            col_combined = ", ".join(temp_col_lst)
            SCHEMA_PLACEHOLDER += f"Table: {k}, Columns: {col_combined}"
            SCHEMA_PLACEHOLDER += "\n"
        
        txt = txt.replace('{{SCHEMA_PLACEHOLDER}}', SCHEMA_PLACEHOLDER)
        
        # Add Value Information (only if value_placeholder is set to True):
        if value_placeholder:
            txt = txt.replace('{{VALUES_PLACEHOLDER}}', str(final_vals_revised))
        
        # Add Examples:
        for entry in cosine_similarities_fsdb_sorted:
            EXAMPLES_PLACEHOLDER += f"NLQ: {entry['NLQ']} \n + KQL: {entry['KQL']}"
            EXAMPLES_PLACEHOLDER += "\n"
    
        txt = txt.replace('{{EXAMPLES_PLACEHOLDER}}', EXAMPLES_PLACEHOLDER)
        txt = txt.replace('{{USER_REQUEST_PLACEHOLDER}}', USER_PLACEHOLDER)
        
        if mode == 'genai-client':
            try:
                start = time.time()
                response = client.models.generate_content(model = model_name, contents = txt)
                end = time.time()
            except:
                print(f"Error with {query_count}")
                failed_queries.append(query_count)
                fail = True
                time.sleep(60)
        
        elif mode == 'openai':
            start = time.time()
            response = client_two.chat.completions.create(
                model = model_name,
                messages = [
                    {"role": "user", "content": txt}
                ],
            ) 
            end = time.time()
            
        # Add the time:
        elapsed_time.append((end-start))
        
        # Add the tokens:
        inputs.append(txt)

        if mode == 'genai-client':
            if fail == False:
                llm_kql_query = response.text
            else:
                llm_kql_query = "Error"
        elif mode == 'openai':
            llm_kql_query = response.choices[0].message.content
            
        df = pd.concat([df, pd.DataFrame([{'NLQ': query_prompt, 'LLM-KQL': llm_kql_query}])], ignore_index=True)
        query_count += 1
        
        print(llm_kql_query)
        print(f"Queries processed: {query_count}")
        time.sleep(20)

### Step #6: Latency Output

Prints out the average latency for all queries.

In [None]:
sum(elapsed_time)/len(elapsed_time)

### Step #7: Cost Analysis:

Prints out the average cost for running all queries. In the following cell we calculate the costs of running the model. You will need to change the input and output token costs accordingly.

In [None]:
# NOTE: You need to update the costs here:
llm_input_cost_per_million = 0.10
llm_output_cost_per_million = 0.40

input_tokens = 0
output_tokens = 0

if mode == 'openai':
    encoding = tiktoken.encoding_for_model(model_name)

    for entry in inputs:
        input_tokens = input_tokens + len(encoding.encode(entry))
    
    for entry in df['LLM-KQL']:
        output_tokens = output_tokens + len(encoding.encode(entry))

    cost = ((llm_input_cost_per_million * input_tokens)/1000000) + ((llm_output_cost_per_million * output_tokens)/1000000)
    avg_cost = cost/iterations
    
    print(f"Average Total Cost: ${avg_cost}")

elif mode == 'genai-client':

    for entry in inputs:
        input_tokens = input_tokens + client.models.count_tokens(model=model_name, contents=str(entry)).total_tokens
    
    for entry in df['LLM-KQL']:
        output_tokens = output_tokens + client.models.count_tokens(model=model_name, contents=str(entry)).total_tokens

    cost = ((llm_input_cost_per_million * input_tokens)/1000000) + ((llm_output_cost_per_million * output_tokens)/1000000)
    avg_cost = cost/iterations
    
    print(f"Average Total Cost: ${avg_cost}")

### Step #8: Cleaning + Query Refiner

Next we must feed the results through the Query Refiner. In order to do this, we must clean up the results to **only include** the first provided KQL query. Please note that manual revision to remove all extra results might also be needed, and additional regex logic may need to be edited/added depending on the model that you are testing.

In [None]:
extracted_results = []
count = 0

for idx, row in df.iterrows():
    try:
        query = re.search(r'(?:~~~|```)(?:kusto|kql)(.*)(?:~~~|```)', row['LLM-KQL'], flags=re.DOTALL).group(1)
        extracted_results.append(query)
    except:
        extracted_results.append(row['LLM-KQL'])

df['LLM-KQL-Extracted'] = extracted_results

# Save results to .csv in "temp" folder, you can rename this folder as you wish:
revised_model_name = model_name
os.makedirs('temp', exist_ok = True)

baseline = []
for i in range(0,iterations):
    baseline = baseline + baselines

df['baseline'] = baseline
df.to_csv(f'temp/{revised_model_name}-cleaned.csv')

Before we run the query parser, you will need to manually revise the KQL queries so that for each entry, there is only one KQL query. This is because of the LLM/SLM tendency to hallucinate and produce its own Natural Language Queries (NLQ) and respective KQL responses. Although we have tried to mitigate these circumstances to the best of our ability, it is not perfect. **PLEASE** do review the results in the .csv before going forward to ensure that only one KQL query is provided in the results.

In [None]:
# DO NOT CHANGE THE FOLLOWING VARIABLE:
runner = "../Query_Refiner/query_parser_runner.py"

# CHANGE THE FOLLOWING VARIABLES:

# This should point to where your .yaml is currently stored (entire path must be specified):
file_of_interest = f'temp/{revised_model_name}-cleaned.csv'

# These should point to where you would like to store your results (entire path must be specified):
stored_folder = "temp"

!python {runner} {file_of_interest} {revised_model_name} {stored_folder}

### Step #9: Metrics

The following snippets will obtain all metrics for the results that you have just calculated. Please note that full path **must** be specified for ```file_of_interest``` and ```folder```. If you need help specifying the full path, use the ```!pwd``` command in Jupyter notebook.

In [None]:
# DO NOT CHANGE THE FOLLOWING VARIABLE:
runner = "../offline_metrics_pipeline/offline-metrics-pipeline/offline_metrics_runner.py"

# CHANGE THE FOLLOWING VARIABLES:

# This should point to where your .yaml is currently stored (entire path must be specified):
file_of_interest = ""

# These should point to where you would like to store your results (entire path must be specified):
folder = ""
results_file = "testing.csv"

!python {runner} {file_of_interest} {folder} {results_file}