# The Task

In part 2, we evaluated several LMs of less than 7 billion parameters on their ability to "detect the domain(s) being discussed every time the customer speaks". We did this using 500 collected utterances in the MultiWOZ dataset. The input-output pairs were as follows:


* **Input Prompt:** We construct an input prompt for the SLM. This prompt contains the conversation history starting from the first utterance up to and including the *current* customer utterance. This provides the necessary context for the model.
* **Target Output:** The expected output (or ground truth label) for this input is the set of domain(s) that are active or being discussed at that specific turn, according to the MultiWOZ 2.2 annotations.

In this notebook, we simplify the input. Rather than using the conversation history as context, we only use the particular utterance we want to detect domains for. We want to see if the simpler input has a higher signal-to-noise ratio than the conversation history.


**A Note on Prompting:**
The selected models will be used out-of-the-box, without fine-tuning. While I've made every effort to ensure that the simple prompt I've written for this task works well across models, I note that the effect of prompting engineering on model performance is non-negligible. 

# Installing the Latest Version of Transformers

Because gemma3 requires it.

In [1]:
!pip install git+https://github.com/huggingface/transformers.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-51wurrqa
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-51wurrqa
  Resolved https://github.com/huggingface/transformers.git to commit 08e3217bafddc5d11ce0e7369bcfaaabe5501ba5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers==4.52.0.dev0)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.30.2-py3-none-any.whl (481 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.4/481.4 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hBuilding wheels for collected packages: transformers
  Building wheel for transformers 

## Set Up Your Hugging Face Token

See [here](https://huggingface.co/docs/transformers.js/en/guides/private) for more details.

In [2]:
from huggingface_hub.hf_api import HfFolder
HfFolder.save_token("")

# The Dataset

We'll use the modified part 1 dataset. We'll modify it be taking only the last utterance from the collected utterances.

## Get Modified Input

In [3]:
from datasets import Dataset

sampled_dataset = Dataset.from_csv('../input/domain_detection_on_500_sample.csv')
sampled_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['conversation_idx', 'dialogue_id', 'turn_idx', 'turn_context', 'messages', 'gemma_3-1b', 'gemma_3-4b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-3b', 'gemini_flash2.0', 'domains'],
    num_rows: 500
})

In [4]:
def get_last_utterance(example):
    return {'turn_utterance': example['turn_context'].strip().split('\n')[-1]}

sampled_dataset = sampled_dataset.map(get_last_utterance)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [5]:
sampled_dataset[0]

{'conversation_idx': 98,
 'dialogue_id': 'PMUL3891.json',
 'turn_idx': 4,
 'turn_context': "Customer: I need a place to stay in the east that includes free wifi.\nAgent: We have 6 guesthouses and 1 hotel. Is there a specific price range you want to stay in?\nCustomer: I am fine with any price range, please pick a guesthouse.\nAgent: I'd recommend the allenbell in the east. Would you like a room?\nCustomer: Yes, please. I'll need it booked on Sunday, for 5 nights, and it will only be 1 person.\n",
 'messages': '[{\'content\': \'You are an expert at domain detection. You are listening to a call between a customer and an agent.\', \'role\': \'system\'}\n {\'content\': \'Your task is to identify the domains discussed by the customer in their most recent dialogue turn.\\n\\nInstructions:\\n1.  Analyze the customer\\\'s last utterance.\\n2.  Determine which domains from the following list are being discussed: restaurant, hotel, attraction, taxi, train, hospital.\\n3.  Do not include any doma

## Prepare the Domains

When saving the dataset, we encode the lists as strings. We need to decode them again.

In [6]:
from ast import literal_eval

domains = sampled_dataset['domains']
sampled_dataset = sampled_dataset.remove_columns('domains')
sampled_dataset = sampled_dataset.add_column('domains', [literal_eval(domain) for domain in domains])
sampled_dataset['domains'][0]

['hotel']

# Formatting the Input

## Defining the Prompt

In [7]:
VALID_DOMAINS = set([domain for domains in sampled_dataset['domains'] for domain in domains])
VALID_DOMAINS

{'attraction', 'hotel', 'restaurant', 'taxi', 'train'}

In [8]:
prompt_template = """Your task is to identify the domains discussed by the customer in their most recent dialogue turn.

Instructions:
1.  Analyze the given utterance.
2.  Determine which domains from the following list are being discussed: {all_domains}
3.  Do not include any domains not mentioned in the above list.
4.  Return your answer as a list of lowercase strings representing the identified domains. Don't say anything else.
5.  Remember, more than one domain can apply but you'll be penalised for including things that don't belong.
6.  If no domains from the list are mentioned, return an empty list: []

Examples:

--- Example 1 ---
    ```
    Customer: I need a cheap place to eat
    ```
    Output: ["restaurant"]

--- Example 2 ---
    ```
    Customer: What sort of museum is it? Before I forget, I also need a train back to Manchester.
    ```
    Output: ["attraction", "train"]

Now, analyse the following dialogue and provide all the domains being discussed in the customer's most recent utterance:
    ```
    {utterance}
    ```
    Output:"""

## Putting the Input in the Prompt

In [9]:
def preprocessing_function(example):
    prompt = prompt_template.format(utterance=example["turn_utterance"], all_domains=VALID_DOMAINS)
    messages = [
        {"role": "system", "content": "You are an expert at domain detection. You are listening to a call between a customer and an agent."},
        {"role": "user", "content": prompt}
    ]
    example["messages"] =  messages
    return example

In [10]:
sampled_dataset = sampled_dataset.remove_columns('messages')
sampled_dataset

Dataset({
    features: ['conversation_idx', 'dialogue_id', 'turn_idx', 'turn_context', 'gemma_3-1b', 'gemma_3-4b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-3b', 'gemini_flash2.0', 'turn_utterance', 'domains'],
    num_rows: 500
})

In [11]:
sampled_dataset = sampled_dataset.map(preprocessing_function)
sampled_dataset

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset({
    features: ['conversation_idx', 'dialogue_id', 'turn_idx', 'turn_context', 'gemma_3-1b', 'gemma_3-4b', 'qwen2.5-0.5b', 'qwen2.5-1.5b', 'qwen2.5-3b', 'gemini_flash2.0', 'turn_utterance', 'domains', 'messages'],
    num_rows: 500
})

In [12]:
sampled_dataset['messages'][0]

[{'content': 'You are an expert at domain detection. You are listening to a call between a customer and an agent.',
  'role': 'system'},
 {'content': 'Your task is to identify the domains discussed by the customer in their most recent dialogue turn.\n\nInstructions:\n1.  Analyze the given utterance.\n2.  Determine which domains from the following list are being discussed: {\'train\', \'restaurant\', \'taxi\', \'hotel\', \'attraction\'}\n3.  Do not include any domains not mentioned in the above list.\n4.  Return your answer as a list of lowercase strings representing the identified domains. Don\'t say anything else.\n5.  Remember, more than one domain can apply but you\'ll be penalised for including things that don\'t belong.\n6.  If no domains from the list are mentioned, return an empty list: []\n\nExamples:\n\n--- Example 1 ---\n    ```\n    Customer: I need a cheap place to eat\n    ```\n    Output: ["restaurant"]\n\n--- Example 2 ---\n    ```\n    Customer: What sort of museum is i

# Making Predictions on the Dataset

## Function for Parsing Model Output

In [13]:
from ast import literal_eval

def parse_model_outputs(model_outputs):
        
    # model_outputs = [output[0]['generated_text'][-1]['content'] for output in model_outputs]

    num_failed, predicted_domains = 0, []
    
    for output in model_outputs:
        # Anticipating parsing errors
        try:
            output = output[0]['generated_text'][-1]['content']
            output = output.replace('`','').replace('‘', '').replace('’', '')
            prediction = literal_eval(output)
        except Exception as e:
            print(f'Failed to parse with exception {e}', output)
            prediction = []
            num_failed += 1
        predicted_domains.append(prediction)
        
    print('Failed to parse %d predictions out of %d predictions.'%(num_failed, len(model_outputs)))

    return predicted_domains

## More Helpers

In [14]:
import gc
import time
import torch
from transformers.pipelines.pt_utils import KeyDataset
from transformers import pipeline, AutoTokenizer

## The Evaluation Pipeline

In [78]:
def evaluate_domain_predictions(targets, predictions, with_heuristics=False, print_error=True):
    """
    Evaluates domain prediction performance, rounding outputs to 3 decimal places.

    Args:
        targets: A list of lists, where each inner list contains the target domains for a sample.
        predictions: A list of lists, where each inner list contains the predicted domains for a sample.

    Returns:
        A dictionary containing evaluation metrics (rounded to 3dp):
            - 'exact_match_accuracy': The percentage of samples where all predicted domains exactly match the target domains.
            - 'partial_match_accuracy': The percentage of samples where at least one predicted domain matches a target domain.
            - 'precision': The average precision across all samples.
            - 'recall': The average recall across all samples.
            - 'f1': The average F1-score across all samples.
    """

    if len(targets) != len(predictions):
        raise ValueError("Targets and predictions lists must have the same length.")

    num_samples = len(targets)
    exact_match_count = 0
    partial_match_count = 0
    total_precision = 0.0
    total_recall = 0.0
    total_f1 = 0.0

    for i in range(num_samples):
        target_set = set(targets[i])

        try:
            if with_heuristics:
                prediction_set = set([pred for pred in predictions[i] if pred in VALID_DOMAINS])
            else:
                prediction_set = set(predictions[i])
        except Exception as e:
            prediction_set = set()
            if print_error:
                print(f"Something's wrong here {e}. Predicted domain {predictions[i]}. Correct domain {target_set}")
        
        if target_set == prediction_set:
            exact_match_count += 1

        if len(target_set.intersection(prediction_set)) > 0:
            partial_match_count += 1

        if len(prediction_set) > 0:
            precision = len(target_set.intersection(prediction_set)) / len(prediction_set)
        else:
            precision = 0.0

        if len(target_set) > 0:
            recall = len(target_set.intersection(prediction_set)) / len(target_set)
        else:
            recall = 0.0

        if precision + recall > 0:
            f1 = 2 * (precision * recall) / (precision + recall)
        else:
            f1 = 0.0

        total_precision += precision
        total_recall += recall
        total_f1 += f1

    exact_match_accuracy = round(exact_match_count / num_samples, 3)
    partial_match_accuracy = round(partial_match_count / num_samples, 3)
    average_precision = round(total_precision / num_samples, 3)
    average_recall = round(total_recall / num_samples, 3)
    average_f1 = round(total_f1 / num_samples, 3)

    return {
        'exact_match_accuracy': exact_match_accuracy,
        'partial_match_accuracy': partial_match_accuracy,
        'precision': average_precision,
        'recall': average_recall,
        'f1': average_f1,
    }

# Example Usage
targets = [["domain1", "domain2"], ["domain3"], ["domain1", "domain4", "domain5"]]
predictions = [["domain1", "domain2"], ["domain3", "domain6"], ["domain1", "domain4"]]

results = evaluate_domain_predictions(targets, predictions)
print(results)

{'exact_match_accuracy': 0.333, 'partial_match_accuracy': 1.0, 'precision': 0.833, 'recall': 0.889, 'f1': 0.822}


## Gemma3-1B

In [16]:
start_time = time.time()
model_name = "google/gemma-3-1b-it"
PIPE = pipeline("text-generation", model=model_name, temperature=0.01, device_map='cuda')
model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4)
predictions = parse_model_outputs(model_outputs)
print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')

config.json:   0%|          | 0.00/899 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Device set to use cuda


Failed to parse 0 predictions out of 500 predictions.
Finished running predictions on google/gemma-3-1b-it after 127.15 seconds


In [18]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.remove_columns('gemma_3-1b')
results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)
print(results)
sampled_dataset = sampled_dataset.add_column('gemma_3-1b', [str(pred) for pred in predictions])

{'exact_match_accuracy': 0.552, 'partial_match_accuracy': 0.464, 'precision': 0.456, 'recall': 0.425, 'f1': 0.433}


In [19]:
del PIPE
torch.cuda.empty_cache()
gc.collect()

88

## Gemma3-4B

In [20]:
start_time = time.time()
model_name = "google/gemma-3-4b-it"
PIPE = pipeline("text-generation", model=model_name, temperature=0.01, torch_dtype=torch.bfloat16)
model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4,)
predictions = parse_model_outputs(model_outputs)
results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)
print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')
print(results)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Device set to use cuda:0


Failed to parse 0 predictions out of 500 predictions.
Finished running predictions on google/gemma-3-4b-it after 777.91 seconds
{'exact_match_accuracy': 0.584, 'partial_match_accuracy': 0.496, 'precision': 0.488, 'recall': 0.461, 'f1': 0.468}


In [21]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.remove_columns('gemma_3-4b')
results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)
print(results)
sampled_dataset = sampled_dataset.add_column('gemma_3-4b', [str(pred) for pred in predictions])

{'exact_match_accuracy': 0.584, 'partial_match_accuracy': 0.496, 'precision': 0.488, 'recall': 0.461, 'f1': 0.468}


In [22]:
del PIPE
torch.cuda.empty_cache()
gc.collect()

399

## Qwen2.5-0.5B

In [35]:
start_time = time.time()

model_name = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
PIPE = pipeline("text-generation", model=model_name, tokenizer=tokenizer, temperature=0.01)

model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4)
predictions = parse_model_outputs(model_outputs)
print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')

Device set to use cuda:0


Failed to parse 0 predictions out of 500 predictions.
Finished running predictions on Qwen/Qwen2.5-0.5B-Instruct after 66.64 seconds


In [51]:
results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)

Something's wrong here [['hotel']]. Correct domain {'hotel'}
Something's wrong here [['hotel']]. Correct domain {'train'}
Something's wrong here [['hotel']]. Correct domain set()
Something's wrong here [['attraction']]. Correct domain set()
Something's wrong here [['attraction']]. Correct domain {'restaurant'}
Something's wrong here [['hotel']]. Correct domain {'train'}
Something's wrong here [['hotel']]. Correct domain set()
Something's wrong here [['']]. Correct domain set()
Something's wrong here [['phone']]. Correct domain {'attraction'}
Something's wrong here [['']]. Correct domain set()
Something's wrong here [['']]. Correct domain set()
Something's wrong here [['']]. Correct domain set()
Something's wrong here [['hotel']]. Correct domain set()
Something's wrong here [['']]. Correct domain set()
Something's wrong here [['hotel']]. Correct domain set()
Something's wrong here [['hotel']]. Correct domain set()
Something's wrong here [['hotel']]. Correct domain set()
Something's wron

In [52]:
print(results)

{'exact_match_accuracy': 0.51, 'partial_match_accuracy': 0.504, 'precision': 0.488, 'recall': 0.461, 'f1': 0.467}


In [53]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.remove_columns('qwen2.5-0.5b')
sampled_dataset = sampled_dataset.add_column('qwen2.5-0.5b', [str(pred) for pred in predictions])

In [54]:
del PIPE
torch.cuda.empty_cache()
gc.collect()

3180

## Qwen2.5-1.5B

In [55]:
start_time = time.time()

model_name = "Qwen/Qwen2.5-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
PIPE = pipeline("text-generation", model=model_name, tokenizer=tokenizer, temperature=0.01)

model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4)
predictions = parse_model_outputs(model_outputs)

results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)

print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')
print(results)

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cuda:0


Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f7210440e50> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f72057d43a0> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f72057d43d0> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f7205ed6aa0> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f72057d7c70> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f72057d4310> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f72057d5510> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f7205b06530> [travel]
Failed to parse with exception malformed node or string on line 1: <ast.Name object at 0x7f72057d5300> [

In [56]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.remove_columns('qwen2.5-1.5b')
sampled_dataset = sampled_dataset.add_column('qwen2.5-1.5b', [str(pred) for pred in predictions])

In [57]:
# del PIPE
torch.cuda.empty_cache()
gc.collect()

66

## Qwen2.5-3B

In [58]:
start_time = time.time()

model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
PIPE = pipeline("text-generation", model=model_name, tokenizer=tokenizer, temperature=0.01, torch_dtype=torch.bfloat16)

model_outputs = PIPE(KeyDataset(sampled_dataset, 'messages'), batch_size=4)
predictions = parse_model_outputs(model_outputs)

results = evaluate_domain_predictions(sampled_dataset['domains'], predictions)

print(f'Finished running predictions on {model_name} after {round(time.time() - start_time, 2)} seconds')
print(results)

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cuda:0


Failed to parse 0 predictions out of 500 predictions.
Finished running predictions on Qwen/Qwen2.5-3B-Instruct after 597.73 seconds
{'exact_match_accuracy': 0.642, 'partial_match_accuracy': 0.56, 'precision': 0.554, 'recall': 0.518, 'f1': 0.529}


In [60]:
# encode the predictions as strings and save them
sampled_dataset = sampled_dataset.remove_columns('qwen2.5-3b')
sampled_dataset = sampled_dataset.add_column('qwen2.5-3b', [str(pred) for pred in predictions])

In [61]:
del PIPE
torch.cuda.empty_cache()
gc.collect()

51

# Comparing with Gemini 2.0 Flash

In [59]:
from openai import OpenAI

GEMINI_API_KEY = 

client = OpenAI(
    api_key=GEMINI_API_KEY,
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

In [62]:
gemini_results = []

for row in sampled_dataset:
    messages = row['messages']
    try:
        response = client.chat.completions.create(
            model="gemini-2.0-flash",
            messages=messages,
            stream=False
        )
        prediction = literal_eval(response.choices[0].message.content)
        gemini_results.append(prediction)
    except Exception as e:
        print('Failed')
        gemini_results.append(prediction)
        
len(gemini_results), gemini_results[:2]

(500, [['hotel'], ['attraction']])

In [64]:
results = evaluate_domain_predictions(sampled_dataset['domains'], gemini_results)
print(results)

{'exact_match_accuracy': 0.682, 'partial_match_accuracy': 0.59, 'precision': 0.586, 'recall': 0.552, 'f1': 0.562}


In [65]:
sampled_dataset = sampled_dataset.remove_columns('gemini_flash2.0')
sampled_dataset = sampled_dataset.add_column('gemini_flash2.0', [str(pred) for pred in gemini_results])

In [66]:
sampled_dataset.to_csv('/kaggle/working/domain_detection_on_500_sample_p2.csv')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1095456

# Summary

Limiting the input does not help

In [88]:
domains = sampled_dataset['domains']
sampled_dataset = sampled_dataset.remove_columns('domains')
sampled_dataset = sampled_dataset.add_column('domains', [str(domain) for domain in domains])

In [89]:
sampled_dataset.to_csv('/kaggle/working/domain_detection_on_500_sample_pt2.csv')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1095624

In [92]:
models_used = list(sampled_dataset.features)[6:12]
models_used

['gemma_3-1b',
 'gemma_3-4b',
 'qwen2.5-0.5b',
 'qwen2.5-1.5b',
 'qwen2.5-3b',
 'gemini_flash2.0']

In [93]:
domains = [literal_eval(d) for d in sampled_dataset['domains']]

for model_name in models_used:
    print(f'------------- {model_name.upper()} ----------')
    predictions = [literal_eval(pred) for pred in sampled_dataset[model_name]]
    print(evaluate_domain_predictions(domains, predictions, with_heuristics=True))

------------- GEMMA_3-1B ----------
{'exact_match_accuracy': 0.556, 'partial_match_accuracy': 0.464, 'precision': 0.46, 'recall': 0.425, 'f1': 0.435}
------------- GEMMA_3-4B ----------
{'exact_match_accuracy': 0.584, 'partial_match_accuracy': 0.496, 'precision': 0.488, 'recall': 0.461, 'f1': 0.468}
------------- QWEN2.5-0.5B ----------
Something's wrong here unhashable type: 'list'. Predicted domain [['hotel']]. Correct domain {'hotel'}
Something's wrong here unhashable type: 'list'. Predicted domain [['hotel']]. Correct domain {'train'}
Something's wrong here unhashable type: 'list'. Predicted domain [['hotel']]. Correct domain set()
Something's wrong here unhashable type: 'list'. Predicted domain [['attraction']]. Correct domain set()
Something's wrong here unhashable type: 'list'. Predicted domain [['attraction']]. Correct domain {'restaurant'}
Something's wrong here unhashable type: 'list'. Predicted domain [['hotel']]. Correct domain {'train'}
Something's wrong here unhashable ty