# Refactored code for
* Setting up and running Ollama in Kaggle
* Downloading THUIAR dataset
* Zero-Shot Prompt
* Use LLM to classify intent from an input 'question' dataset
* To configure your file/folder paths, LLM, dataset, start_index and end_index for each run, please update the config.py file

This notebook will also be used as the base to test any fixes to the LLM intent classification pipeline.
* 2025.05.26: Updated results output file from JSON to Pickle, to store list of dictionaries. 1 dictionary contains the results for each record. Lists of dictionaries can be downloaded from multiple notebooks, then concatenated for analysis
* 2025.05.30: Update prompt and bulletpts_intent.
  * Check if dataset contains 'oos' (out of scope) category
  * If dataset has no 'oos' (out of scope) category, turn 1 category into 'oos'. Use updated categories in bulletpts_intent. Also update prompt instructions on when to classify an example as 'oos'
  * **This force_oos fix is implemented in [notebook 01E](https://www.kaggle.com/code/kaiquanmah/01e-kaggle-ollama-llama3-2-w-force-oos?scriptVersionId=242648764)**
* 2025.05.30: Add pydantic schema with enums
  * From an analysis of errors, the model previously had a 45% average accuracy rate across categories. The model predicted a set of categories outside of what we gave it in 'bulletpts_intent'
  * To fix this, we will try to implement a pydantic schema solution for the model to only predict categories from the allowed list of categories ('bulletpts_intent')
* 2025.05.30: Set Ollama chat temperature to 0
  * Previously, we used the default temperature of 0.8, which might have caused the model to predict categories we did not provide to it ([Reading](https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html))
  * **The pydantic schema and temperature fixes are implemented in [notebook 01F](https://www.kaggle.com/code/kaiquanmah/01f-kaggle-ollama-llama3-2-w-pydantic-schema)**
* 2025.06.03:
  1. **Remove 'oos' from `bulletpts_intent` input into prompt**, to be consistent with the team's approach when exploring embedding approaches to classify 'oos' examples. **Keep 'oos' in pydantic enums/Literal (for LLM to output 'oos' as an allowed class value)**
  2. **Remove 0.99 when defining the prompt format - to avoid anchoring LLM on outputting confidence of 0.99**
  3. **Added ability for user to define which classes are 'oos'**
  * **These 3 fixes are in [notebook 01G](https://www.kaggle.com/code/kaiquanmah/01g-kaggle-ollama-llama3-2-oos-update)**
* 2025.06.10:
  * From an error analysis earlier, **models can get confused between similar intent classes**
  * Therefore **we will analyse similar intent classes/labels -> get their indexes -> put them into 'oos' in [notebook 01H](https://www.kaggle.com/code/kaiquanmah/01h1-openintent-ollama-llama3-2-3b-banking77)**
  * **Going from zero-shot prompt previously, to few-shot prompt (with 5 examples) from known intents**. These 5 examples were **non-oos, and misclassified previously**. This 'fix' is in **[notebook 01i](https://www.kaggle.com/code/kaiquanmah/01i1-openintent-ollama-llama3-2-3b-banking77)**
* 2025.06.16:
  * For known intents (ie not in the 'oos' class), give 5 examples each in the few-shot prompt **[notebook 01J](https://www.kaggle.com/code/kaiquanmah/01j1-openintent-ollama-llama3-2-3b-banking77)**
* 2025.06.17:
  * Now we explore how changing the number of known intent classes affects the recall of oos in **[notebook 01K](https://www.kaggle.com/code/kaiquanmah/01k1-openintent-ollama-llama3-2-3b-banking77)**
  * For quick experimentation, we implement (1) fewshot prompt with 1 example per known intent class, (2) changing number of known intent classes in various notebook runs, (3) 100 oos sentences for the model to classify (taking from first class for banking77 and stackoverflow dataset, or the oos class for CLINC150 oos dataset)
    * For (3) - Added 'first_class' variable for each dataset to Config
    * For (3) - Created new fn to filter and keep 100 records from 'first/oos class' to input to the model to classify
* 2025.07.07:
  * Explore free, rate-limited API model (such as Gemini) in **[notebook 01L](https://www.kaggle.com/code/kaiquanmah/01l1-openintent-gemini-banking77-explore)**
  * Added retry for when we exhaust API limits per minute
  * Updated end_index tracking that works with Ollama and Gemini when generating JSON results file
  * **Explore Qwen model from the Nebius platform**

In [15]:
# 1. create dirs if they do not exist
import os
os.makedirs('/kaggle/working/src', exist_ok=True)
os.makedirs('/kaggle/working/prediction', exist_ok=True)

In [16]:
%%writefile /kaggle/working/src/setup_ollama.py
import os
import subprocess
import time
from src.config import Config # absolute import

# 1. Install Ollama (if not already installed)
try:
    # Check if Ollama is already installed
    subprocess.run(["ollama", "--version"], capture_output=True, check=True)
    print("Ollama is already installed.")
except FileNotFoundError:
    print("Installing Ollama...")
    subprocess.run("curl -fsSL https://ollama.com/install.sh  | sh", shell=True, check=True)

# 2. Start Ollama server in the background
print("Starting Ollama server...")
process = subprocess.Popen("ollama serve", shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

# Wait for the server to initialize
time.sleep(5)


# 3. Pull the model
model_name = Config.model_name
print(f"Pulling {model_name} model...")
subprocess.run(["ollama", "pull", model_name], check=True)

# 4. Install Python client
subprocess.run(["pip", "install", "ollama"], check=True)

print("Ollama setup complete!")

Overwriting /kaggle/working/src/setup_ollama.py


In [17]:
%%writefile requirements.txt
pandas
pydantic
typing
huggingface-hub
# google-genai # only used for gemini model
openai # used for openrouter's gemini model
tenacity # for gemini model retries
# numpy
# enum

Overwriting requirements.txt


In [18]:
%%writefile /kaggle/working/src/__init__.py
# folder for config

Overwriting /kaggle/working/src/__init__.py


In [19]:
%%writefile /kaggle/working/src/config.py
class Config:
    #######################################################
    # working directory for files
    #######################################################
    target_dir = '/kaggle/working/data' # data directory to clone into
    cloned_data_dir = target_dir + '/data'
    prediction_dir = target_dir + '/prediction'
    #######################################################
    # dataset and model
    #######################################################
    dataset_name = 'stackoverflow' # UPDATE options: 'banking', 'stackoverflow', 'oos'
    idx2label_target_dir = '/kaggle/working/idx2label'
    idx2label_filename_hf = 'stackoverflow_idx2label.csv' # UPDATE options: banking77_idx2label.csv, stackoverflow_idx2label.csv, clinc150_oos_idx2label.csv
    fewshot_examples_dir = '/kaggle/working/fewshot'
    fewshot_subdir = '/fewshot-5examples-per-nonoos/'
    fewshot_examples_filename = 'stackoverflow_25perc_oos.txt' # UPDATE options: banking_25perc_oos.txt, stackoverflow_25perc_oos.txt, oos_25perc_oos.txt
    list_oos_idx = [0, 3, 10, 12, 14] # UPDATE gathered from within the team - for reproducible, comparable results with other open intent classification approaches
    model_name = 'Qwen3-30B-A3B' # 'gemma-2-9b-it-fast'
    start_index=0 # eg: 0, 10001, 11851
    end_index=None # eg: 10, 10000, 11850 or None (use end_index=None to process the full dataset)
    log_every_n_examples=10 # 2
    force_oos = True  # NEW: Add flag to force dataset to contain 'oos' class for the last class value (sorted alphabetically), if 'oos' class does not exist in the original dataset
    #######################################################
    # evaluate threshold when 'oos' recall drops
    #######################################################
    filter_oos_qns_only = False # True (when you are testing 'oos' recall threshold), False
    n_oos_qns = 100
    first_class_banking = 'activate_my_card' # following idx2label
    first_class_stackoverflow = 'wordpress' # following idx2label
    first_class_oos = 'oos'
    #######################################################

Overwriting /kaggle/working/src/config.py


In [20]:
%%writefile download_dataset.py
from src.config import Config
import os
import subprocess
target_dir = Config.target_dir # data directory to clone into
cloned_data_dir = Config.cloned_data_dir

# Create target directory if it doesn't exist
os.makedirs(target_dir, exist_ok=True)

# do not clone dataset repo if cloned data folder exists
if os.path.exists(cloned_data_dir):
    print("Dataset has already been downloaded. If this is incorrect, please delete the Adaptive-Decision-Boundary 'data' folder.")
else:
    # Clone the repository
    subprocess.run(["git",
                    "clone",
                    "https://github.com/thuiar/Adaptive-Decision-Boundary.git",
                    target_dir
                   ])

Overwriting download_dataset.py


In [21]:
%%writefile predict_class.py
from src.config import Config
import pandas as pd
import os
# import ollama
import json
import pickle
import time
from pydantic import BaseModel
from typing import Literal
# from enum import Enum
from huggingface_hub import snapshot_download
    
###################
# Gemini API
###################
# from google import genai
# from google.genai.types import ThinkingConfig
# from google.api_core import retry
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_fixed
from kaggle_secrets import UserSecretsClient


###################


# Config.target_dir
# Config.cloned_data_dir'
# Config.dataset_name
# Config.model_name
# Config.start_index
# Config.end_index
# Config.log_every_n_examples


#######################
# load data
#######################
def load_data(data_dir):
    """Loads train, dev, and test datasets from a specified directory."""

    main_df = pd.DataFrame()
    for split in ['train', 'dev', 'test']:
        file_path = os.path.join(data_dir, f'{split}.tsv')
        if os.path.exists(file_path):
          try:
            df = pd.read_csv(file_path, sep='\t')
            df['dataset'] = os.path.basename(data_dir)
            df['split'] = split
            main_df = pd.concat([main_df, df], ignore_index=True)
          except pd.errors.ParserError as e:
            print(f"Error parsing {file_path}: {e}")
            # Handle the error appropriately, e.g., skip the file, log the error, etc.
        else:
            print(f"Warning: {split}.tsv not found in {data_dir}")
    return main_df


def filter100examples_oos(dataset_name, df):
    # dont input 'only oos qns to model'
    if Config.filter_oos_qns_only == False:
        filtered_df = df
    # vs
    # input 'only oos qns to model'
    else:
        if dataset_name == 'banking':
            first_class = Config.first_class_banking
        elif dataset_name == 'stackoverflow':
            first_class = Config.first_class_stackoverflow
        else:
            first_class = Config.first_class_oos
    
        filtered_df = df.copy()
        filtered_df = filtered_df.loc[filtered_df["label"] == first_class]
        filtered_df = filtered_df.sample(n=Config.n_oos_qns, random_state=38)
    return filtered_df


df = pd.DataFrame()

data_dir = os.path.join(Config.cloned_data_dir, Config.dataset_name)
if os.path.exists(data_dir):
  df = load_data(data_dir)
  print(f"Loaded dataset into dataframe: {Config.dataset_name}")
  print(f"Dimensions: {df.shape}")
  print(f"Col names: {df.columns}")
else:
  print(f"Warning: Directory {data_dir} not found.")
#######################



#######################
# unique intents
#######################
sorted_intent = list(sorted(df.label.unique()))
print("="*80)
print(f"Original dataset intents: {sorted_intent}")
print(f"Number of original intents: {len(sorted_intent)}\n")


# 2025.06.03
# New OOS approach - get 25/50/75% of class indexes for each dataset within the team (for reproducibility and comparable results)
# Change their class labels to 'oos'
snapshot_download(repo_id="KaiquanMah/open-intent-query-classification", repo_type="space", allow_patterns="*_idx2label.csv", local_dir=Config.idx2label_target_dir)
idx2label_filepath = Config.idx2label_target_dir + '/dataset_idx2label/' + Config.idx2label_filename_hf
idx2label = pd.read_csv(idx2label_filepath)
idx2label_oos = idx2label[idx2label.index.isin(Config.list_oos_idx)]
idx2label_oos.reset_index(drop=True, inplace=True)

# 2025.06.17 keep track of non-oos labels, to use in IntentSchema
nonoos_labels = idx2label[~idx2label.label.isin(Config.list_oos_idx)]['label'].values
print("="*80)
print("Original intents to convert to OOS class")
print(idx2label_oos)
print(f"Percentage of original intents to convert to OOS class: {len(idx2label_oos)/len(idx2label)}\n")

oos_labels = idx2label_oos['label'].values
list_sorted_intent_aft_conversion = ['oos' if intent.lower() in oos_labels else intent for intent in sorted_intent]
list_sorted_intent_aft_conversion_deduped = sorted(set(list_sorted_intent_aft_conversion))
print("="*80)
print("Unique intents after converting some to OOS class")
print(list_sorted_intent_aft_conversion_deduped)
print(f"Number of unique intents after converting some to OOS class: {len(list_sorted_intent_aft_conversion_deduped)}\n")



# unique intents - from set to bullet points (to use in prompts)
# bulletpts_intent = "\n".join(f"- {category}" for category in set_intent)
# 2025.06.03: do not show 'oos' in the prompt (to avoid leakage of 'oos' class)
bulletpts_intent = "\n".join(f"- {category}" for category in list_sorted_intent_aft_conversion_deduped if category and (category!='oos'))

# 2025.06.04: fix adjustment if 'oos' is already in the original dataset
int_oos_in_orig_dataset = int('oos' in idx2label.label.values)
adjust_if_oos_not_in_orig_dataset = [0 if int_oos_in_orig_dataset == 1 else 1][0]

print("="*80)
print("sanity check")
print(f"Number of original intents: {len(sorted_intent)}")
print(f"Number of original intents + 1 OOS class (if doesnt exist in original dataset): {len(sorted_intent) + adjust_if_oos_not_in_orig_dataset}")
print(f"Number of original intents to convert to OOS class: {len(idx2label_oos)}")
print(f"Percentage of original intents to convert to OOS class: {len(idx2label_oos)/len(idx2label)}")
print(f"Number of unique intents after converting some to OOS class: {len(list_sorted_intent_aft_conversion_deduped)}")
print(f"Number of original intents + 1 OOS class (if doesnt exist in original dataset) - converted classes: {len(sorted_intent) + adjust_if_oos_not_in_orig_dataset - len(idx2label_oos)}")
print(f"Numbers match: {(len(sorted_intent) + adjust_if_oos_not_in_orig_dataset - len(idx2label_oos)) == len(list_sorted_intent_aft_conversion_deduped)}")
print("Prepared unique intents")
#######################




#######################
# Enforce schema on the model (e.g. allowed list of predicted categories)
#######################

class IntentSchema(BaseModel):
    # dynamically unpack list of categories for different dataset(s)
    category: Literal[*list_sorted_intent_aft_conversion_deduped]
    confidence: float
    
#######################




#######################
# filter after preparing intents
#######################
df = filter100examples_oos(Config.dataset_name, df)
print("Filtered dataset")
print(f"Dimensions: {df.shape}")
print(f"Col names: {df.columns}")
#######################



#######################
# Prompt
#######################
# prompt 2 with less information/compute, improve efficiency
# 2025.06.10 prompt 3 with 5 few shot examples only - notebook O1H1, O1i1
# 2025.06.16 prompt 4 with 5 examples per each known intent (ie non-oos intent) - notebook 01J1
snapshot_download(repo_id="KaiquanMah/open-intent-query-classification", repo_type="space", allow_patterns="*.txt", local_dir=Config.fewshot_examples_dir)
with open(Config.fewshot_examples_dir + Config.fewshot_subdir + Config.fewshot_examples_filename, 'r') as file:
    fewshot_examples = file.read()

def get_prompt(dataset_name, split, question, categories, fewshot_examples):
    
    prompt = f'''
You are an expert in understanding and identifying what users are asking you.

Your task is to analyze an input query from a user and assign the most appropriate category from the following list:
{categories}

Only classify as "oos" (out of scope category) if none of the other categories apply.

Below are several examples to guide your classification:

---
{fewshot_examples}
---

===============================

New Question: {question}

===============================

Provide your final classification in **valid JSON format** with the following structure:
{{
  "category": "your_chosen_category_name",
  "confidence": confidence_level_rounded_to_the_nearest_2_decimal_places
}}


Ensure the JSON has:
- Opening and closing curly braces
- Double quotes around keys and string values
- Confidence as a number (not a string), with maximum 2 decimal places

Do not include any explanations or extra text.
            '''
    return prompt



#######################


#######################
# Model on 1 Dataset
#######################
# Save a list of dictionaries 
# containing a dictionary for each record's
# - predicted category
# - confidence level and
# - original dataframe values


# gemini
user_secrets = UserSecretsClient()
NEBIUS_API_KEY = user_secrets.get_secret("NEBIUS_API_KEY")
client = OpenAI(base_url="https://api.studio.nebius.com/v1/",
                api_key = NEBIUS_API_KEY)

@retry(stop=stop_after_attempt(3), wait=wait_fixed(30))
def api_llm(client, prompt):
    try:
        print("CHECKPOINT_3A")
        # gemini_config = {"temperature": 0,
        #                  "response_mime_type": "application/json",
        #                  "response_schema": IntentSchema.model_json_schema(),
        #                  "seed": 38,
        #                  # # added for "gemini-2.5-flash-lite-preview-06-17" model
        #                  # "thinking_config": ThinkingConfig(thinking_budget=-1, 
        #                  #                    include_thoughts=True)
        #                 }
        response = client.beta.chat.completions.parse(model = 'Qwen/'+Config.model_name,
                                                      messages = [{"role": "user",
                                                                  "content": prompt}],
                                                      response_format = IntentSchema,
                                                      seed = 38,
                                                      temperature = 0
                                                      )
        # print(response)
        # msg = response.parsed
        response = response.choices[0].message.content
        print("CHECKPOINT_3B")
        return response
    except:
        print(f"CHECKPOINT_4A: Exception Type: {type(e).__name__}")
        print(f"CHECKPOINT_4A: Exception Message: {str(e)}")
        
        # Gemini-specific errors
        if hasattr(e, 'code'):
            print(f"CHECKPOINT_4A: Status Code: {e.code}")
        if hasattr(e, 'details'):
            print(f"CHECKPOINT_4A: Details: {e.details}")
        
        # raise the exception again so retry can work
        raise

    

def predict_intent(model_name, df, categories, start_index=0, end_index=None, log_every_n_examples=100):
    start_time = time.time()
    results = []  # Store processed results
    
    # Slice DataFrame based on start/end indices
    if end_index is None:
        subset_df = df.iloc[start_index:]
    else:
        subset_df = df.iloc[start_index:end_index+1]
    
    total_rows = len(subset_df)
    subset_row_count = 0

    

    
    
    for row in subset_df.itertuples():
        subset_row_count+=1
        prompt = get_prompt(row.dataset, row.split, row.text, categories, fewshot_examples)
        if subset_row_count == 1:
            print("Example of how prompt looks, for the 1st example in this subset of data")
            print(prompt)

            print("Example of how IntentSchema looks")
            print(IntentSchema.model_json_schema())
        
        
        try:
            print("CHECKPOINT_1A")
            
            # response = ollama.chat(model=model_name, 
            #                        messages=[
            #                                     {'role': 'user', 'content': prompt}
            #                                 ],
            #                        format = IntentSchema.model_json_schema(),
            #                        options = {'temperature': 0},  # Set temperature to 0 for a more deterministic output
            #                       )
            # msg = response['message']['content']
            # parsed = json.loads(msg)
            
            response = api_llm(client, prompt)
            print("CHECKPOINT_1B")
            parsed = json.loads(response.text)
            # parsed = response.parsed
            print("CHECKPOINT_1C")
                        
            # Safely extract keys with defaults - resolve parsing error
            # maybe LLM did not output a particular key-value pair
            category = parsed.get('category', 'error')
            confidence = parsed.get('confidence', 0.0)
            parsed = {'category': category, 'confidence': confidence}
        except (json.JSONDecodeError, KeyError, Exception) as e:
            print(f"CHECKPOINT_2A: Exception Type: {type(e).__name__}")
            print(f"CHECKPOINT_2A: Exception Message: {str(e)}")
            
            # Gemini-specific errors
            if hasattr(e, 'code'):
                print(f"CHECKPOINT_2A: Status Code: {e.code}")
            if hasattr(e, 'details'):
                print(f"CHECKPOINT_2A: Details: {e.details}")
                
            parsed = {'category': 'error', 'confidence': 0.0}
        
        # Combine original row data with predictions
        results.append({
            "Index": row.Index,
            "text": row.text,
            "label": row.label,
            "dataset": row.dataset,
            "split": row.split,
            "predicted": parsed['category'],
            "confidence": parsed['confidence']
        })

        
        # Log progress
        if subset_row_count % log_every_n_examples == 0:
            elapsed_time = time.time() - start_time
            
            avg_time_per_row = elapsed_time / subset_row_count
            remaining_rows = total_rows - subset_row_count
            eta = avg_time_per_row * remaining_rows
            
            print(f"Processed original df idx {row.Index} (subset row {subset_row_count}) | "
                  f"Elapsed: {elapsed_time:.2f}s | ETA: {eta:.2f}s")
    
    return results  # Return list of dictionaries
    

print(f"Starting intent classification using {Config.model_name}")
subset_results = predict_intent(Config.model_name, 
                                df, 
                                bulletpts_intent, 
                                start_index = Config.start_index, 
                                end_index = Config.end_index,
                                log_every_n_examples = Config.log_every_n_examples)



# # previously for Ollama
# # update end_index for filename (if None is used for the end of the df)
# # Get the last index of the DataFrame
# last_index = df.index[-1] 
# # Use last index if Config.end_index is None
# end_index = Config.end_index if Config.end_index is not None else last_index
# 2025.07.07
# now for Ollama AND Gemini
# Gemini - needs to track 'end_index' for API JSON exports (when daily limits are exhausted)
# Ollama - reuse this code
end_index = max(r['Index'] for r in subset_results)



# 2025.05.23 changed from JSON to PKL
# because we are saving list of dictionaries
# Save to PKL
# 2025.06.04 explore changing back to JSON
# with open(f'results_{Config.model_name}_{Config.dataset_name}_{Config.start_index}_{end_index}.pkl', 'wb') as f:
#     pickle.dump(subset_results, f)
with open(f'results_{Config.model_name}_{Config.dataset_name}_{Config.start_index}_{end_index}.json', 'w') as f:
    json.dump(subset_results, f, indent=2)

print("Completed intent classification")


#######################


Overwriting predict_class.py


In [22]:
%%writefile /kaggle/working/main.py
import subprocess
import sys
from src.config import Config


# 1. Install libraries from requirements.txt
print("Installing dependencies...")
subprocess.run([sys.executable, "-m", "pip", "install", "-r", "/kaggle/working/requirements.txt"], check=True)


# # 2. Run setup_ollama.py
# if 'gemini' not in Config.model_name:
#     print("Starting Ollama setup...")
#     # subprocess.run(["python3", "/kaggle/working/src/setup_ollama.py"], check=True)
#     print("Starting Ollama setup...")
#     subprocess.run(
#         ["python3", "-m", "src.setup_ollama"],  # Run as a module
#         cwd="/kaggle/working",  # Set working directory to parent of 'src'
#         check=True
#     )
    

# 3. Run download_dataset.py
print("Downloading dataset...")
subprocess.run(["python3", "/kaggle/working/download_dataset.py"], check=True)

# 4. Run predict_class.py
print("Running prediction script...")
subprocess.run(["python3", "/kaggle/working/predict_class.py"], check=True)

Overwriting /kaggle/working/main.py


# Model on subset of examples

In [None]:
!python3 /kaggle/working/main.py

# Sanity check folders

In [None]:
!cd /kaggle/working/ && ls -la

In [None]:
!cd /kaggle/working/src && ls -la

In [None]:
!cd /kaggle/working/data/data && ls -la

# idx2label_oos examples

In [None]:
pip install huggingface-hub

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="KaiquanMah/open-intent-query-classification", repo_type="space", allow_patterns="*_idx2label.csv", local_dir='/kaggle/working/idx2label')

In [None]:
import pandas as pd
idx2label = pd.read_csv('/kaggle/working/idx2label/dataset_idx2label/banking77_idx2label.csv')
idx2label

In [None]:
idx2label_oos = idx2label[idx2label.index.isin([31,32,33,36])]
idx2label_oos

In [None]:
print(idx2label_oos)

In [None]:
idx2label_oos.shape

In [None]:
# percentage of OOS classes over ALL classes in the dataset
len(idx2label_oos)/len(idx2label)

# Stitch Nebius Qwen API Batch Results for Stackoverflow

In [23]:
import subprocess
import sys
from src.config import Config


# 1. Install libraries from requirements.txt
print("Installing dependencies...")
subprocess.run([sys.executable, "-m", "pip", "install", "-r", "/kaggle/working/requirements.txt"], check=True)


# # 2. Run setup_ollama.py
# if 'gemini' not in Config.model_name:
#     print("Starting Ollama setup...")
#     # subprocess.run(["python3", "/kaggle/working/src/setup_ollama.py"], check=True)
#     print("Starting Ollama setup...")
#     subprocess.run(
#         ["python3", "-m", "src.setup_ollama"],  # Run as a module
#         cwd="/kaggle/working",  # Set working directory to parent of 'src'
#         check=True
#     )
    

# 3. Run download_dataset.py
print("Downloading dataset...")
subprocess.run(["python3", "/kaggle/working/download_dataset.py"], check=True)

Installing dependencies...
Downloading dataset...
Dataset has already been downloaded. If this is incorrect, please delete the Adaptive-Decision-Boundary 'data' folder.


CompletedProcess(args=['python3', '/kaggle/working/download_dataset.py'], returncode=0)

In [24]:
from src.config import Config
import pandas as pd
import os
# import ollama
import json
import pickle
import time
from pydantic import BaseModel
from typing import Literal
# from enum import Enum
from huggingface_hub import snapshot_download
    
###################
# Gemini API
###################
# from google import genai
# from google.genai.types import ThinkingConfig
# from google.api_core import retry
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_fixed
from kaggle_secrets import UserSecretsClient


###################


# Config.target_dir
# Config.cloned_data_dir'
# Config.dataset_name
# Config.model_name
# Config.start_index
# Config.end_index
# Config.log_every_n_examples


#######################
# load data
#######################
def load_data(data_dir):
    """Loads train, dev, and test datasets from a specified directory."""

    main_df = pd.DataFrame()
    for split in ['train', 'dev', 'test']:
        file_path = os.path.join(data_dir, f'{split}.tsv')
        if os.path.exists(file_path):
          try:
            df = pd.read_csv(file_path, sep='\t')
            df['dataset'] = os.path.basename(data_dir)
            df['split'] = split
            main_df = pd.concat([main_df, df], ignore_index=True)
          except pd.errors.ParserError as e:
            print(f"Error parsing {file_path}: {e}")
            # Handle the error appropriately, e.g., skip the file, log the error, etc.
        else:
            print(f"Warning: {split}.tsv not found in {data_dir}")
    return main_df


def filter100examples_oos(dataset_name, df):
    # dont input 'only oos qns to model'
    if Config.filter_oos_qns_only == False:
        filtered_df = df
    # vs
    # input 'only oos qns to model'
    else:
        if dataset_name == 'banking':
            first_class = Config.first_class_banking
        elif dataset_name == 'stackoverflow':
            first_class = Config.first_class_stackoverflow
        else:
            first_class = Config.first_class_oos
    
        filtered_df = df.copy()
        filtered_df = filtered_df.loc[filtered_df["label"] == first_class]
        filtered_df = filtered_df.sample(n=Config.n_oos_qns, random_state=38)
    return filtered_df


df = pd.DataFrame()

data_dir = os.path.join(Config.cloned_data_dir, Config.dataset_name)
if os.path.exists(data_dir):
  df = load_data(data_dir)
  print(f"Loaded dataset into dataframe: {Config.dataset_name}")
  print(f"Dimensions: {df.shape}")
  print(f"Col names: {df.columns}")
else:
  print(f"Warning: Directory {data_dir} not found.")
#######################



#######################
# unique intents
#######################
sorted_intent = list(sorted(df.label.unique()))
print("="*80)
print(f"Original dataset intents: {sorted_intent}")
print(f"Number of original intents: {len(sorted_intent)}\n")


# 2025.06.03
# New OOS approach - get 25/50/75% of class indexes for each dataset within the team (for reproducibility and comparable results)
# Change their class labels to 'oos'
snapshot_download(repo_id="KaiquanMah/open-intent-query-classification", repo_type="space", allow_patterns="*_idx2label.csv", local_dir=Config.idx2label_target_dir)
idx2label_filepath = Config.idx2label_target_dir + '/dataset_idx2label/' + Config.idx2label_filename_hf
idx2label = pd.read_csv(idx2label_filepath)
idx2label_oos = idx2label[idx2label.index.isin(Config.list_oos_idx)]
idx2label_oos.reset_index(drop=True, inplace=True)

# 2025.06.17 keep track of non-oos labels, to use in IntentSchema
nonoos_labels = idx2label[~idx2label.label.isin(Config.list_oos_idx)]['label'].values
print("="*80)
print("Original intents to convert to OOS class")
print(idx2label_oos)
print(f"Percentage of original intents to convert to OOS class: {len(idx2label_oos)/len(idx2label)}\n")

oos_labels = idx2label_oos['label'].values
list_sorted_intent_aft_conversion = ['oos' if intent.lower() in oos_labels else intent for intent in sorted_intent]
list_sorted_intent_aft_conversion_deduped = sorted(set(list_sorted_intent_aft_conversion))
print("="*80)
print("Unique intents after converting some to OOS class")
print(list_sorted_intent_aft_conversion_deduped)
print(f"Number of unique intents after converting some to OOS class: {len(list_sorted_intent_aft_conversion_deduped)}\n")



# unique intents - from set to bullet points (to use in prompts)
# bulletpts_intent = "\n".join(f"- {category}" for category in set_intent)
# 2025.06.03: do not show 'oos' in the prompt (to avoid leakage of 'oos' class)
bulletpts_intent = "\n".join(f"- {category}" for category in list_sorted_intent_aft_conversion_deduped if category and (category!='oos'))

# 2025.06.04: fix adjustment if 'oos' is already in the original dataset
int_oos_in_orig_dataset = int('oos' in idx2label.label.values)
adjust_if_oos_not_in_orig_dataset = [0 if int_oos_in_orig_dataset == 1 else 1][0]

print("="*80)
print("sanity check")
print(f"Number of original intents: {len(sorted_intent)}")
print(f"Number of original intents + 1 OOS class (if doesnt exist in original dataset): {len(sorted_intent) + adjust_if_oos_not_in_orig_dataset}")
print(f"Number of original intents to convert to OOS class: {len(idx2label_oos)}")
print(f"Percentage of original intents to convert to OOS class: {len(idx2label_oos)/len(idx2label)}")
print(f"Number of unique intents after converting some to OOS class: {len(list_sorted_intent_aft_conversion_deduped)}")
print(f"Number of original intents + 1 OOS class (if doesnt exist in original dataset) - converted classes: {len(sorted_intent) + adjust_if_oos_not_in_orig_dataset - len(idx2label_oos)}")
print(f"Numbers match: {(len(sorted_intent) + adjust_if_oos_not_in_orig_dataset - len(idx2label_oos)) == len(list_sorted_intent_aft_conversion_deduped)}")
print("Prepared unique intents")
#######################




#######################
# Enforce schema on the model (e.g. allowed list of predicted categories)
#######################

class IntentSchema(BaseModel):
    # dynamically unpack list of categories for different dataset(s)
    category: Literal[*list_sorted_intent_aft_conversion_deduped]
    confidence: float
    
#######################




#######################
# filter after preparing intents
#######################
df = filter100examples_oos(Config.dataset_name, df)
print("Filtered dataset")
print(f"Dimensions: {df.shape}")
print(f"Col names: {df.columns}")
#######################



#######################
# Prompt
#######################
# prompt 2 with less information/compute, improve efficiency
# 2025.06.10 prompt 3 with 5 few shot examples only - notebook O1H1, O1i1
# 2025.06.16 prompt 4 with 5 examples per each known intent (ie non-oos intent) - notebook 01J1
snapshot_download(repo_id="KaiquanMah/open-intent-query-classification", repo_type="space", allow_patterns="*.txt", local_dir=Config.fewshot_examples_dir)
with open(Config.fewshot_examples_dir + Config.fewshot_subdir + Config.fewshot_examples_filename, 'r') as file:
    fewshot_examples = file.read()

def get_prompt(dataset_name, split, question, categories, fewshot_examples):
    
    prompt = f'''
You are an expert in understanding and identifying what users are asking you.

Your task is to analyze an input query from a user and assign the most appropriate category from the following list:
{categories}

Only classify as "oos" (out of scope category) if none of the other categories apply.

Below are several examples to guide your classification:

---
{fewshot_examples}
---

===============================

New Question: {question}

===============================

Provide your final classification in **valid JSON format** with the following structure:
{{
  "category": "your_chosen_category_name",
  "confidence": confidence_level_rounded_to_the_nearest_2_decimal_places
}}


Ensure the JSON has:
- Opening and closing curly braces
- Double quotes around keys and string values
- Confidence as a number (not a string), with maximum 2 decimal places

Do not include any explanations or extra text.
            '''
    return prompt



#######################



Loaded dataset into dataframe: stackoverflow
Dimensions: (20000, 4)
Col names: Index(['text', 'label', 'dataset', 'split'], dtype='object')
Original dataset intents: ['ajax', 'apache', 'bash', 'cocoa', 'drupal', 'excel', 'haskell', 'hibernate', 'linq', 'magento', 'matlab', 'oracle', 'osx', 'qt', 'scala', 'sharepoint', 'spring', 'svn', 'visual-studio', 'wordpress']
Number of original intents: 20



Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Original intents to convert to OOS class
   index    label
0      3      svn
1     11   spring
2     15     ajax
3     16       qt
4     17   drupal
5     18     linq
6     20  magento
Percentage of original intents to convert to OOS class: 0.35

Unique intents after converting some to OOS class
['apache', 'bash', 'cocoa', 'excel', 'haskell', 'hibernate', 'matlab', 'oos', 'oracle', 'osx', 'scala', 'sharepoint', 'visual-studio', 'wordpress']
Number of unique intents after converting some to OOS class: 14

sanity check
Number of original intents: 20
Number of original intents + 1 OOS class (if doesnt exist in original dataset): 21
Number of original intents to convert to OOS class: 7
Percentage of original intents to convert to OOS class: 0.35
Number of unique intents after converting some to OOS class: 14
Number of original intents + 1 OOS class (if doesnt exist in original dataset) - converted classes: 14
Numbers match: True
Prepared unique intents
Filtered dataset
Dimensions: (20000, 

Fetching 62 files:   0%|          | 0/62 [00:00<?, ?it/s]

In [25]:
len(df)

20000

In [26]:
df

Unnamed: 0,text,label,dataset,split
0,Scala Regex Multiple Block Capturing,scala,stackoverflow,train
1,Use Oracle 6 from ASP.NET application,oracle,stackoverflow,train
2,HQL 1 to many count() question,hibernate,stackoverflow,train
3,scala syntax highlighting in bluefish,scala,stackoverflow,train
4,Weird bindings issue,cocoa,stackoverflow,train
...,...,...,...,...
19995,SharePoint remembering changed password,sharepoint,stackoverflow,test
19996,Magento - Find Out of Stock Products With Inve...,magento,stackoverflow,test
19997,Python OS X 10.5 development environment,osx,stackoverflow,test
19998,Crop & Resize Images in Wordpress,wordpress,stackoverflow,test


In [27]:
######################
# preprocess batch outputs from Nebius
# then export to intermediate.json
######################
import json

def jsonl_to_json_extract_content(jsonl_file_path, json_file_path):
    """
    Reads a JSONL file, extracts LLM message content, and saves to a JSON file.

    Args:
        jsonl_file_path (str): The path to the input JSONL file.
        json_file_path (str): The path to the output JSON file.
    """
    extracted_data = []
    with open(jsonl_file_path, 'r') as infile:
        for line in infile:
            record_dict = {}

            data = json.loads(line)
            # Extract content from the nested structure
            record_dict["Index"] = data["custom_id"]

            content = json.loads(data["response"]["choices"][0]["message"]["content"])
            record_dict["predicted"] = content["category"]
            record_dict["confidence"] = content["confidence"]
            # Remove backslashes and newline characters
            # content = content.replace('\\', '').replace('\n', '').strip()
            extracted_data.append(record_dict)

    with open(json_file_path, 'w') as outfile:
        json.dump(extracted_data, outfile, indent=2)

# Example usage
jsonl_file_path = '/kaggle/input/01l8-openintent-nebiusqwen-stackoverflow-batch/batch_outputs_stackoverflow_0_None.jsonl'
json_file_path = 'batch_outputs_stackoverflow_intermediate.json'
jsonl_to_json_extract_content(jsonl_file_path, json_file_path)

print(f"Extracted content saved to {json_file_path}")

Extracted content saved to batch_outputs_stackoverflow_intermediate.json


In [28]:
######################
# import intermediate.json
# then stitch data with original df 
# to trace and retrieve original text qn and label/class
######################
import pandas as pd
import json

# Load JSON data
with open('/kaggle/working/batch_outputs_stackoverflow_intermediate.json', 'r') as f:
    json_data = json.load(f)  # List of dictionaries

# Create lookup dictionary: {Index: {predicted, confidence}}
json_lookup = {item["Index"]: item for item in json_data}  # Keys are strings

# Initialize results list
results = []

# Iterate through DataFrame rows
for row in df.itertuples():
    # Convert DataFrame Index to string for consistent lookup
    idx = row.Index
    
    # Retrieve prediction if exists, else None
    pred_entry = json_lookup.get(str(idx))
    
    # Build result dictionary
    results.append({
        "Index": row.Index,
        "text": row.text,
        "label": row.label,
        "dataset": row.dataset,
        "split": row.split,
        "predicted": pred_entry['predicted'],
        "confidence": pred_entry['confidence']
    })

# Now `results` contains your combined data
subset_results = results

#################################################
# export to JSON in the format we expect
#################################################
model_name = Config.model_name
df = df
categories = bulletpts_intent
start_index = Config.start_index
end_index = Config.end_index
log_every_n_examples = Config.log_every_n_examples
end_index = max(r['Index'] for r in subset_results)

with open(f'results_{Config.model_name}_{Config.dataset_name}_{Config.start_index}_{end_index}.json', 'w') as f:
    json.dump(subset_results, f, indent=2)

print("Completed intent classification")



Completed intent classification
