# The Task

In goal-oriented dialogue systems, understanding the user's current topic or domain is fundamental for providing relevant assistance. This notebook benchmarks Small Language Models (SLMs) on this important task of domain detection using the MultiWOZ dataset. For various SLMs, we want to answer the question **_To what extent can this model detect the domain(s) being discussed every time the customer speaks?'_**.


By SLMs here I do mean _small large language models_, particularly instruction fine-tuned openly available language models with 7 billion parameters or less. I've limited myself to Qwen2.5, Gemini3 and Llama3.2 models. 


**A Note on Prompting:**
The selected models will be used out-of-the-box, without fine-tuning. While I've made every effort to ensure that the simple prompt I've written for this task works well across models, I note that the effect of prompting engineering on model performance is non-negligible. 

# Installing the Latest Version of Transformers

Because gemma3 requires it.

In [31]:
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-i43b5ecv
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-i43b5ecv
  Resolved https://github.com/huggingface/transformers.git to commit b54c2f46891149210dbbe118fca55b1357a47003
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers==4.52.0.dev0)
  Downloading huggingface_hub-0.30.1-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.30.1-py3-none-any.whl (481 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.2/481.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hBuilding wheels for collected packages: transformers
  Building wheel for transformers 

## Set Up Your Hugging Face Token

See [here](https://huggingface.co/docs/transformers.js/en/guides/private) for more details.

In [32]:
from huggingface_hub.hf_api import HfFolder
HfFolder.save_token("")

# The Dataset

We'll use the MultiWOZ 2.2 validation dataset for this evaluation.

To create the input-output pairs for benchmarking, we process each dialogue turn-by-turn. Specifically, at every point where the **customer** speaks:

* **Input Prompt:** We construct an input prompt for the SLM. This prompt contains the conversation history starting from the first utterance up to and including the *current* customer utterance. This provides the necessary context for the model.
* **Target Output:** The expected output (or ground truth label) for this input is the set of domain(s) that are active or being discussed at that specific turn, according to the MultiWOZ 2.2 annotations.

In [33]:
from datasets import load_dataset

dataset = load_dataset("pfb30/multi_woz_v22", trust_remote_code=True)
dataset

README.md:   0%|          | 0.00/15.3k [00:00<?, ?B/s]

multi_woz_v22.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0/22 [00:00<?, ?files/s]

Generating train split:   0%|          | 0/8437 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['dialogue_id', 'services', 'turns'],
        num_rows: 8437
    })
    validation: Dataset({
        features: ['dialogue_id', 'services', 'turns'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['dialogue_id', 'services', 'turns'],
        num_rows: 1000
    })
})

In [34]:
validation_data = dataset['validation']

In [35]:
speaker_names = ["Customer", "Agent"] # customer is encoded as speaker 0, and agent is speaker 1
eval_data = []

for conv_idx, conversation in enumerate(validation_data):

    dialogue_id = conversation['dialogue_id']
    
    speaker_ids = conversation['turns']['speaker']
    
    utterances = conversation['turns']['utterance']

    num_utterances = len(utterances)
    dialogue_acts = [conversation['turns']['dialogue_acts'][idx]['dialog_act'] for idx in range(num_utterances)]
    all_domains =  [conversation['turns']['frames'][idx]['service'] for idx in range(num_utterances)]

    # each time the customer speaks, we consider the collected utterances up to and including
    utterances_string = ""
    utterance_idx = -1
    for speaker_id, utterance, domains in zip(speaker_ids, utterances, all_domains):

        utterance_idx += 1
        
        speaker_name = speaker_names[speaker_id]
        utterances_string += f"{speaker_name}: {utterance}\n"
        
        if speaker_name == "Customer":
            eval_data.append([conv_idx, dialogue_id, utterance_idx, utterances_string, domains])

print(eval_data[13], len(eval_data))

[1, 'PMUL3233.json', 14, "Customer: My husband and I are celebrating our anniversary and want to find a great place to stay in town.\nAgent: Congratulations on your upcoming anniversary! Cambridge offers a variety of lodging options, what is your price range?\nCustomer: I would like a 4 star guesthouse that includes free parking.\nAgent: I have several options for you, is there a particular area you are interested in during your stay?\nCustomer: yes should be in the west\nAgent: I have one guesthouse that fits that criteria, Finches Bed and Breakfast. Would you like me to book for you?\nCustomer: Yes, please! We'll arrive on Monday and stay 2 nights. Just the two of us, of course!\nAgent: Ok, your hotel stay at Finches Bed and Breakfast is booked, Reference number FKRO2HOW . Will there be anything else?\nCustomer: I am wanting to know more about the Cambridge Museum of Technology.\nAgent: Sure, it's located in the centre area of town. The phone number is 01223368650. The entrance fee i

In [36]:
import pandas as pd
columns=["conversation_idx", "dialogue_id", "turn_idx", "turn_context", "domains"]
eval_df = pd.DataFrame(eval_data, columns=columns)
eval_df.head(2)

Unnamed: 0,conversation_idx,dialogue_id,turn_idx,turn_context,domains
0,0,PMUL0698.json,0,Customer: I'm looking for a local place to din...,[restaurant]
1,0,PMUL0698.json,2,Customer: I'm looking for a local place to din...,[restaurant]


## Sample from Validation Data

We'll sample 500 utterances from this dataset for our evaluation

In [8]:
from datasets import Dataset

SAMPLE_SIZE = 500
sampled_dataset = Dataset.from_pandas(eval_df.sample(SAMPLE_SIZE))
sampled_dataset = sampled_dataset.remove_columns('__index_level_0__')
sampled_dataset

Dataset({
    features: ['conversation_idx', 'dialogue_id', 'turn_idx', 'turn_context', 'domains'],
    num_rows: 500
})

# Formatting the Input

## Defining the Prompt

In [65]:
VALID_DOMAINS = set([dom for domains in sampled_dataset['domains'] for dom in domains])
VALID_DOMAINS

{'attraction', 'hotel', 'restaurant', 'taxi', 'train'}

In [10]:
prompt_template = """Your task is to identify the domains discussed by the customer in their most recent dialogue turn.

Instructions:
1.  Analyze the customer's last utterance.
2.  Determine which domains from the following list are being discussed: restaurant, hotel, attraction, taxi, train, hospital.
3.  Do not include any domains not mentioned in the above list.
4.  Return your answer as a list of lowercase strings representing the identified domains. Don't say anything else.
5.  Remember, more than on can apply but you'll be penalised for including things that don't belong.
6.  If no domains from the list are mentioned, return an empty list: []

Examples:

--- Example 1 ---
    ```
    Customer: I need a cheap place to eat
    Agent: We have several not expensive places available. What food are you interested in?
    Customer: Chinese food.
    ```
    Output: ["restaurant"]

--- Example 2 ---
    ```
    Customer: I'm looking for a museum near me. Any museum really.
    Agent: Ok! I've found exactly one museum near you.
    Customer: What is the address?
    Agent: It’s 123 Northfolk Road.
    Customer: What sort of museum is it? Before I forget, I also need a train back to Manchester.
    ```
    Output: ["attraction", "train"]

Now, analyse the following dialogue and provide all the domains being discussed in the customer's most recent utterance:
    ```
    {turn_context}
    ```
    Output:"""

## Putting the Input in the Prompt

In [11]:
def preprocessing_function(example):
    prompt = prompt_template.format(turn_context=example["turn_context"])
    messages = [
        {"role": "system", "content": "You are an expert at domain detection. You are listening to a call between a customer and an agent."},
        {"role": "user", "content": prompt}
    ]
    example["messages"] =  messages
    return example

In [12]:
sampled_dataset = sampled_dataset.map(preprocessing_function)
sampled_dataset

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset({
    features: ['conversation_idx', 'dialogue_id', 'turn_idx', 'turn_context', 'domains', 'messages'],
    num_rows: 500
})

# Making Predictions on the Dataset

## Function for Parsing Model Output

In [13]:
from ast import literal_eval

def parse_model_outputs(model_outputs):
        
    # model_outputs = [output[0]['generated_text'][-1]['content'] for output in model_outputs]

    num_failed, predicted_domains = 0, []
    
    for output in model_outputs:
        # Anticipating parsing errors
        try:
            output = output[0]['generated_text'][-1]['content']
            output = output.replace('`','').replace('‘', '').replace('’', '')
            prediction = literal_eval(output)
        except Exception as e:
            print(f'Failed to parse with exception {e}', output)
            prediction = []
            num_failed += 1
        predicted_domains.append(prediction)
        
    print('Failed to parse %d predictions out of %d predictions.'%(num_failed, len(model_outputs)))

    return predicted_domains

## More Helpers

In [14]:
import gc
import time
import torch
from transformers.pipelines.pt_utils import KeyDataset
from transformers import pipeline, AutoTokenizer

## The Evaluation Pipeline

In [5]:
def evaluate_domain_predictions(targets, predictions, with_heuristics=False):
    """
    Evaluates domain prediction performance, rounding outputs to 3 decimal places.

    Args:
        targets: A list of lists, where each inner list contains the target domains for a sample.
        predictions: A list of lists, where each inner list contains the predicted domains for a sample.

    Returns:
        A dictionary containing evaluation metrics (rounded to 3dp):
            - 'exact_match_accuracy': The percentage of samples where all predicted domains exactly match the target domains.
            - 'partial_match_accuracy': The percentage of samples where at least one predicted domain matches a target domain.
            - 'precision': The average precision across all samples.
            - 'recall': The average recall across all samples.
            - 'f1': The average F1-score across all samples.
    """

    if len(targets) != len(predictions):
        raise ValueError("Targets and predictions lists must have the same length.")

    num_samples = len(targets)
    exact_match_count = 0
    partial_match_count = 0
    total_precision = 0.0
    total_recall = 0.0
    total_f1 = 0.0

    for i in range(num_samples):
        target_set = set(targets[i])
        if with_heuristics:
            prediction_set = set([pred for pred in predictions[i] if pred in VALIDATION_DOMAINS])
        else:
            prediction_set = set(predictions[i])

        if target_set == prediction_set:
            exact_match_count += 1

        if len(target_set.intersection(prediction_set)) > 0:
            partial_match_count += 1

        if len(prediction_set) > 0:
            precision = len(target_set.intersection(prediction_set)) / len(prediction_set)
        else:
            precision = 0.0

        if len(target_set) > 0:
            recall = len(target_set.intersection(prediction_set)) / len(target_set)
        else:
            recall = 0.0

        if precision + recall > 0:
            f1 = 2 * (precision * recall) / (precision + recall)
        else:
            f1 = 0.0

        total_precision += precision
        total_recall += recall
        total_f1 += f1

    exact_match_accuracy = round(exact_match_count / num_samples, 3)
    partial_match_accuracy = round(partial_match_count / num_samples, 3)
    average_precision = round(total_precision / num_samples, 3)
    average_recall = round(total_recall / num_samples, 3)
    average_f1 = round(total_f1 / num_samples, 3)

    return {
        'exact_match_accuracy': exact_match_accuracy,
        'partial_match_accuracy': partial_match_accuracy,
        'precision': average_precision,
        'recall': average_recall,
        'f1': average_f1,
    }

# Example Usage
targets = [["domain1", "domain2"], ["domain3"], ["domain1", "domain4", "domain5"]]
predictions = [["domain1", "domain2"], ["domain3", "domain6"], ["domain1", "domain4"]]

results = evaluate_domain_predictions(targets, predictions)
print(results)

{'exact_match_accuracy': 0.333, 'partial_match_accuracy': 1.0, 'precision': 0.833, 'recall': 0.889, 'f1': 0.822}


## Gemma3-1B

In [15]:
start_time = time.time()
model_name = "google/gemma-3-1b-it"
PIPE = pipeline("text-generation", model=model_name, temperature=0.01, device_map='cuda')
model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4)
predictions = parse_model_outputs(model_outputs)
print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')

config.json:   0%|          | 0.00/899 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Device set to use cuda


Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb332369d20> [hotel, train]
Failed to parse with exception invalid character '“' (U+201C) (<unknown>, line 1) [“restaurant”, “hotel”, “place to stay”]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb33da3add0> [taxi]
Failed to parse with exception invalid character '“' (U+201C) (<unknown>, line 1) [“lodging accommodations”, “guesthouse”]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb33da3b310> [restaurant, hotel]
Failed to parse with exception invalid character '“' (U+201C) (<unknown>, line 1) [“university arms hotel”, “taxi”]
Failed to parse with exception invalid syntax. Perhaps you forgot a comma? (<unknown>, line 1) [restaurant, address, phone number]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb332348cd0> [attraction, museum]
Failed to parse with exception i

In [16]:
# encode the predictions as strings and save them
results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)
print(results)
sampled_dataset = sampled_dataset.add_column('gemma_3-1b', [str(pred) for pred in predictions])

{'exact_match_accuracy': 0.57, 'partial_match_accuracy': 0.656, 'precision': 0.615, 'recall': 0.61, 'f1': 0.599}


In [18]:
del PIPE
torch.cuda.empty_cache()
gc.collect()

324

## Gemma3-4B

In [19]:
start_time = time.time()
model_name = "google/gemma-3-4b-it"
PIPE = pipeline("text-generation", model=model_name, temperature=0.01, torch_dtype=torch.bfloat16)
model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4,)
predictions = parse_model_outputs(model_outputs)
results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)
print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')
print(results)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Device set to use cuda:0


Failed to parse 0 predictions out of 500 predictions.
Finished running predictions on google/gemma-3-4b-it after 1421.12 seconds
{'exact_match_accuracy': 0.44, 'partial_match_accuracy': 0.772, 'precision': 0.629, 'recall': 0.743, 'f1': 0.658}


In [20]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.add_column('gemma_3-4b', [str(pred) for pred in predictions])

In [21]:
del PIPE
torch.cuda.empty_cache()
gc.collect()

794

## Qwen2.5-0.5B

In [22]:
start_time = time.time()

model_name = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
PIPE = pipeline("text-generation", model=model_name, tokenizer=tokenizer, temperature=0.01)

model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4)
predictions = parse_model_outputs(model_outputs)

results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)

print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')
print(results)

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cuda:0


Failed to parse with exception malformed node or string on line 2: <ast.Name object at 0x7eb327b457b0> 
[restaurant]

Failed to parse 1 predictions out of 500 predictions.
Finished running predictions on Qwen/Qwen2.5-0.5B-Instruct after 113.75 seconds
{'exact_match_accuracy': 0.404, 'partial_match_accuracy': 0.576, 'precision': 0.523, 'recall': 0.534, 'f1': 0.515}


In [23]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.add_column('qwen2.5-0.5b', [str(pred) for pred in predictions])

In [24]:
del PIPE
torch.cuda.empty_cache()
gc.collect()

479

## Qwen2.5-1.5B

In [25]:
start_time = time.time()

model_name = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
PIPE = pipeline("text-generation", model=model_name, tokenizer=tokenizer, temperature=0.01)

model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4)
predictions = parse_model_outputs(model_outputs)

results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)

print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')
print(results)

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cuda:0


Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb327b47700> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb32799b310> [train]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb327647040> [train]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb327647cd0> [taxi]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb327998df0> [train]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb327999150> [transport]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb327999630> [train]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb327646e00> [train]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7eb327646bc0> [trai

In [26]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.add_column('qwen2.5-1.5b', [str(pred) for pred in predictions])

In [27]:
# del PIPE
torch.cuda.empty_cache()
gc.collect()

468

## Qwen2.5-3B

In [28]:
start_time = time.time()

model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
PIPE = pipeline("text-generation", model=model_name, tokenizer=tokenizer, temperature=0.01, torch_dtype=torch.bfloat16)

model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4)
predictions = parse_model_outputs(model_outputs)

results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)

print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')
print(results)

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cuda:0


Failed to parse 0 predictions out of 500 predictions.
Finished running predictions on Qwen/Qwen2.5-3B-Instruct after 1079.68 seconds
{'exact_match_accuracy': 0.434, 'partial_match_accuracy': 0.728, 'precision': 0.61, 'recall': 0.694, 'f1': 0.628}


In [29]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.add_column('qwen2.5-3b', [str(pred) for pred in predictions])

In [30]:
del PIPE
torch.cuda.empty_cache()
gc.collect()

446

In [31]:
sampled_dataset.to_csv('/kaggle/working/domain_detection_on_500_sample.csv')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1531756

In [32]:
sampled_dataset

Dataset({
    features: ['conversation_idx', 'dialogue_id', 'turn_idx', 'turn_context', 'domains', 'messages', 'gemma_3-1b', 'gemma_3-4b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-3b'],
    num_rows: 500
})

# Comparing with Gemini 2.0 Flash

In [17]:
sampled_dataset = sampled_dataset.map(preprocessing_function)
sampled_dataset

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset({
    features: ['conversation_idx', 'dialogue_id', 'turn_idx', 'turn_context', 'domains', 'messages', 'gemma_3-1b', 'gemma_3-4b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-3b'],
    num_rows: 500
})

In [19]:
from openai import OpenAI

GEMINI_API_KEY = 

client = OpenAI(
    api_key=GEMINI_API_KEY,
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

In [53]:
gemini_results = []

for row in sampled_dataset:
    messages = row['messages']
    try:
        response = client.chat.completions.create(
            model="gemini-2.0-flash",
            messages=messages,
            stream=False
        )
        prediction = literal_eval(response.choices[0].message.content)
        gemini_results.append(prediction)
    except Exception as e:
        print('Failed')
        gemini_results.append(prediction)
        
len(gemini_results), gemini_results[:2]

(500, [['hotel'], ['train', 'attraction']])

In [67]:
sampled_dataset = sampled_dataset.add_column('gemini_flash2.0', [str(pred) for pred in gemini_results])

In [68]:
sampled_dataset.to_csv('/kaggle/working/domain_detection_on_500_sample.csv')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1538150

# Summary

Qwen2.5-1.5B is oddly worse than the 0.5B version. Other than that, the trend is that the larger the model, the higher the recall, and precision.

In [59]:
models_used = list(sampled_dataset.features)[5:11]
models_used

['gemma_3-1b',
 'gemma_3-4b',
 'qwen2.5-0.5b',
 'qwen2.5-1.5b',
 'qwen2.5-3b',
 'gemini_flash20']

In [60]:
domains = sampled_dataset['domains']

for model_name in models_used:
    print(f'------------- {model_name.upper()} ----------')
    predictions = [literal_eval(pred) for pred in sampled_dataset[model_name]]
    print(evaluate_domain_predictions(domains, predictions, with_heuristics=True))

------------- GEMMA_3-1B ----------
{'exact_match_accuracy': 0.61, 'partial_match_accuracy': 0.656, 'precision': 0.638, 'recall': 0.61, 'f1': 0.613}
------------- GEMMA_3-4B ----------
{'exact_match_accuracy': 0.448, 'partial_match_accuracy': 0.772, 'precision': 0.634, 'recall': 0.743, 'f1': 0.661}
------------- QWEN2.5-0.5B ----------
{'exact_match_accuracy': 0.416, 'partial_match_accuracy': 0.576, 'precision': 0.526, 'recall': 0.534, 'f1': 0.517}
------------- QWEN2.5-1.5B ----------
{'exact_match_accuracy': 0.354, 'partial_match_accuracy': 0.45, 'precision': 0.439, 'recall': 0.405, 'f1': 0.413}
------------- QWEN2.5-3B ----------
{'exact_match_accuracy': 0.44, 'partial_match_accuracy': 0.728, 'precision': 0.614, 'recall': 0.694, 'f1': 0.63}
------------- GEMINI_FLASH20 ----------
{'exact_match_accuracy': 0.714, 'partial_match_accuracy': 0.774, 'precision': 0.724, 'recall': 0.735, 'f1': 0.714}


In [3]:
from datasets import Dataset

sampled_dataset = Dataset.from_csv('/kaggle/input/domain-detection-on-500-sample/domain_detection_on_500_sample.csv')
sampled_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['conversation_idx', 'dialogue_id', 'turn_idx', 'turn_context', 'domains', 'messages', 'gemma_3-1b', 'gemma_3-4b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-3b', 'gemini_flash20'],
    num_rows: 500
})

# Analysing Confusion

In [80]:
from sklearn.metrics import multilabel_confusion_matrix
import numpy as np


def list_to_multilabel(domains):
    """Converts a set of domains to a binary multi-label vector."""
    domain_set = set(domains)
    multilabel_vector = [0] * len(VALID_DOMAINS)
    for i, domain in enumerate(VALID_DOMAINS):
        if domain in domain_set:
            multilabel_vector[i] = 1
    return np.array(multilabel_vector)

In [82]:
predictions = [literal_eval(pred) for pred in sampled_dataset['gemini_flash2.0']]
true_labels = [literal_eval(pred) for pred in sampled_dataset['domains']]

In [83]:
# Convert to multi-label format
true_labels_multilabel = np.array([list_to_multilabel(s) for s in true_labels])
predicted_labels_multilabel = np.array([list_to_multilabel(s) for s in predictions])

print("True Labels (Multi-label):\n", true_labels_multilabel)
print("Predicted Labels (Multi-label):\n", predicted_labels_multilabel)

# Now you can use these multi-label arrays with sklearn's confusion matrix function
confusion_matrices = multilabel_confusion_matrix(true_labels_multilabel, predicted_labels_multilabel)

True Labels (Multi-label):
 [[0 1 0 0 0]
 [1 0 0 0 0]
 [0 0 1 0 0]
 ...
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 1 0 0 0]]
Predicted Labels (Multi-label):
 [[0 1 0 0 0]
 [1 0 1 0 0]
 [0 0 1 0 0]
 ...
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]


In [84]:
# Print the confusion matrices for each domain
for i, domain in enumerate(VALID_DOMAINS):
    print(f"\nConfusion Matrix for {domain}:")
    print(confusion_matrices[i])
    tn, fp, fn, tp = confusion_matrices[i].ravel()
    print(f"True Negative (TN): {tn}")
    print(f"False Positive (FP): {fp}")
    print(f"False Negative (FN): {fn}")
    print(f"True Positive (TP): {tp}")


Confusion Matrix for attraction:
[[402  17]
 [  9  72]]
True Negative (TN): 402
False Positive (FP): 17
False Negative (FN): 9
True Positive (TP): 72

Confusion Matrix for hotel:
[[377  16]
 [ 23  84]]
True Negative (TN): 377
False Positive (FP): 16
False Negative (FN): 23
True Positive (TP): 84

Confusion Matrix for train:
[[358  18]
 [  9 115]]
True Negative (TN): 358
False Positive (FP): 18
False Negative (FN): 9
True Positive (TP): 115

Confusion Matrix for taxi:
[[459  11]
 [  4  26]]
True Negative (TN): 459
False Positive (FP): 11
False Negative (FN): 4
True Positive (TP): 26

Confusion Matrix for restaurant:
[[342  30]
 [ 20 108]]
True Negative (TN): 342
False Positive (FP): 30
False Negative (FN): 20
True Positive (TP): 108


In [77]:
domains = [str(dom) for dom in sampled_dataset['domains']]

sampled_dataset = sampled_dataset.remove_columns('domains')
sampled_dataset = sampled_dataset.add_column('domains', domains)
sampled_dataset.to_csv('/kaggle/working/domain_detection_on_500_sample.csv')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1538319