# Overview

**Project Overview**: I've decided to start by creating a "smart" gateway or LLM router that will route users to specific models based on prompt characteristics, token count, and creativity requirements.

**Technical Implementation Decisions:** Since I'm building a gateway, I don't want to rely on frameworks or other middleware. I believe we should maintain flexibility and control over the entire process, so I'll be using pure APIs. To ensure output stability and parsing ease, I'll be using Pydantic for structured data handling.
For demonstration purposes, I'll use OpenAI models, though in production this could include both open-source and closed-source models of various types.

# 1 Setup

First, we need to import all necessary libraries and datasets, and connect to APIs. To run and use this notebook, you should have your own Hugging Face and OpenAI APIs. The datasets used can be found in the Git repository.

In [None]:
!pip install -U datasets

In [4]:
# AI
import tiktoken
from openai import OpenAI

# Pydantic and Formating
from pydantic import BaseModel, Field, ConfigDict
from typing import Optional, Literal, Tuple, Dict, Any

# Data Manipulations
import pandas as pd
import numpy as np
from datasets import load_dataset

# ML
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
)
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.metrics import classification_report
from huggingface_hub import notebook_login

# Other
import os
from google.colab import userdata
import base64
import time
import re

client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))

# Upload Datasets Training Dataset
file_path = "/content/Train DataFrame.csv"
model_train_df = pd.read_csv(file_path)


# Upload Chat Bot Arena Dataset
notebook_login()

chat_bot_arena_df = pd.read_parquet(
    "hf://datasets/lmsys/chatbot_arena_conversations/data/train-00000-of-00001-cced8514c7ed782a.parquet"
)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# 2 Initial Approach



This section will consist of two parts: a user-like interaction where you can interact with the smart router as a user by passing different queries and receiving results, and a second part where I will run tests on the MMLU benchmark to see how our router performs, which model it selects, and what results we get

## 2.1 Smart Router

**Current Implementation**:I've added analysis functionality to understand why specific models were assigned - this will help us better understand model output and routing decisions. I've started with many components in the current code, but in the next steps I'll clean up and remove unnecessary code to streamline the implementation.


In [None]:
class Analysis(BaseModel):
    """Analysis of the user query to determine routing requirements."""

    model_config = ConfigDict(str_strip_whitespace=True)

    task_type: str
    complexity_score: int = Field(ge=1, le=5)
    domain: str
    performance_needs: Literal["speed", "quality"]
    estimated_tokens: int = Field(gt=0)
    creativity: Literal["low", "medium", "high"]


class RoutingDecision(BaseModel):
    """Decision about which LLM to use for the query."""

    model_config = ConfigDict(str_strip_whitespace=True)

    recommended_llm: Literal["gpt-4.1-nano", "gpt-4.1", "o4-mini"]
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str


class EvaluationResponse(BaseModel):
    """Complete evaluation response containing analysis and routing decision."""

    analysis: Analysis
    routing_decision: RoutingDecision


class RouteLLM:

    def __init__(self, query: str, temperature: float, client: OpenAI) -> None:
        self.query = query
        self.temperature = temperature
        self.client = client

    def format_evaluation_prompt(self) -> str:
        """
        Format evaluation prompt which takes into account temperature and number of tokens
        """

        enc = tiktoken.get_encoding("o200k_base")
        num_tokens = len(enc.encode(self.query))

        system_prompt = f"""
        ### Context
        Evaluate the user query with specific attention to parameters: temperature (`{self.temperature}`) and number of tokens (`{num_tokens}`). These parameters influence creativity and complexity of tasks, affecting model selection.

        ### Role
        You are an evaluator responsible for choosing the most suitable model for the user query, ensuring alignment with user-provided constraints on temperature and token count.

        ### Task
        Analyze the User Query Using These Criteria:
        1. Task Type: Determine the user's specific goal or need.
        2. Complexity: Score between 1 and 5, reflecting reasoning depth needed; token count {num_tokens} should guide complexity assessment.
        3. Domain: Assess whether specialized knowledge is required.
        4. Performance Needs: Consider if speed or quality is more important, influenced by the balance between high token count and low count settings.
        5. Token Count Estimation: Predict token usage based on the complexity and detail expected from the user's specified token count.
        6. Creativity Needs: Ascertain the creativity level needed based on temperature {self.temperature}.

        Model Selection:
        - Choose the model that best fits the task, respecting specified creativity needs and token constraints.

        ### Available Models
        1. gpt-4.1-nano
           - Strengths: Fast, cost-effective
           - Best for: Simple Q&A, basic tasks, low token and creativity needs
           - Cost: $0.2/1M tokens

        2. gpt-4.1
           - Strengths: Complex reasoning, problem-solving
           - Best for: Detailed coding, nuanced tasks with medium to high token and creativity needs
           - Cost: $2.00/1M tokens

        3. o4-mini
           - Strengths: Superior reasoning, creativity
           - Best for: Complex problem-solving, detailed planning with high creativity and token demands
           - Cost: $1.10/1M tokens

        ### Examples
        #### Example 1
        Prompt: 'What is 2+2?'
        Output:
        ```json
        {{
            "analysis": {{
                "task_type": "calculation",
                "complexity_score": 1,
                "domain": "math",
                "performance_needs": "speed",
                "estimated_tokens": 10,
                "creativity": "low"
            }},
            "routing_decision": {{
                "recommended_llm": "gpt-4.1-nano",
                "confidence": 0.8,
                "reasoning": "The task is simple with low token and creativity needs, suitable for gpt-4.1-nano."
            }}
        }}
        ```

        #### Example 2
        Prompt: 'Write Python code for a basic analysis tool using langchain.'
        Output:
        ```json
        {{
            "analysis": {{
                "task_type": "coding",
                "complexity_score": 3,
                "domain": "coding",
                "performance_needs": "quality",
                "estimated_tokens": 1642,
                "creativity": "medium"
            }},
            "routing_decision": {{
                "recommended_llm": "gpt-4.1",
                "confidence": 0.9,
                "reasoning": "Given medium creativity and high token count required, gpt-4.1 fits the needs."
            }}
        }}
        ```

        #### Example 3
        Prompt: 'Implement and classification model for my binary problem.'
        Output:
        ```json
        {{
            "analysis": {{
                "task_type": "data science",
                "complexity_score": 5,
                "domain": "data science",
                "performance_needs": "quality",
                "estimated_tokens": 3600,
                "creativity": "medium"
            }},
            "routing_decision": {{
                "recommended_llm": "o4-mini",
                "confidence": 0.9,
                "reasoning": "Given task difficulty, o4-mini fits the needs."
            }}
        }}
        ```

        ### Constraints
        - Perform detailed, step-by-step reasoning, considering the user-defined temperature and token constraints.
       """

        return system_prompt

    def get_evaluation_response(self) -> str:
        """
        Get the model as response based on the prompt
        """

        response = self.client.beta.chat.completions.parse(
            model="gpt-4.1-nano",
            messages=[
                {"role": "system", "content": self.format_evaluation_prompt()},
                {"role": "user", "content": self.query},
            ],
            temperature=0.7,
            response_format=EvaluationResponse,
        )

        validated_response = response.choices[0].message.parsed
        print(validated_response)

        return validated_response.routing_decision.recommended_llm

    def get_gpt_response(self, model_name: str) -> str:
        """
        Get the response from routed model
        """

        response = self.client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": self.query}],
            temperature=self.temperature,
        )
        print(response.choices[0].message.content)

        return response.choices[0].message.content

    def orchestrate_response(self) -> str:
        """
        Orchestrate the response from the model
        """

        recommended_llm = self.get_evaluation_response()
        print(recommended_llm)

        if recommended_llm == "gpt-4.1-nano":
            return self.get_gpt_response("gpt-4.1-nano"), "gpt-4.1-nano"
        elif recommended_llm == "gpt-4.1":
            return self.get_gpt_response("gpt-4.1"), "gpt-4.1"
        elif recommended_llm == "o4-mini":
            return self.get_gpt_response("o4-mini"), "o4-mini"


query = input("Enter your query: ")
default_temperature = 0.7

temp_input = input(
    f"Enter the temperature or leave blank for default, default is [{default_temperature}]: "
)

if temp_input.strip() == "":
    temperature = default_temperature
else:
    temperature = float(temp_input)

llm = RouteLLM(query, temperature, client)
llm.orchestrate_response()

Enter your query: what is capital of france?
Enter the temperature or leave blank for default, default is [0.7]: 
analysis=Analysis(task_type='factual question', complexity_score=1, domain='geography', performance_needs='speed', estimated_tokens=6, creativity='low') routing_decision=RoutingDecision(recommended_llm='gpt-4.1-nano', confidence=0.8, reasoning='The question is a simple factual query with low complexity and knowledge needs, requiring quick response and minimal creativity, making gpt-4.1-nano suitable.')
gpt-4.1-nano
The capital of France is Paris.


The initial approach is working, but it has many issues that should be fixed and addressed:

* The initial prompt is too large and uses too many tokens, so we need to shrink it down
* We don't take into account files, because users can pass them too, and this will impact our performance
* The analysis part is also not necessary in production; however, it gives a good idea to add maybe a classifier of domains or prompt difficulty, which would help us achieve better results with our own models
* Lastly, in this approach I don't check the structure of the prompt and if it contains XML. These patterns can indicate that the user will need a more sophisticated model with more complex prompting

## 2.2 Smart Router Evaluation

Next thing I wanted to test is how this router performs, for this purpose I will user several rows from differnt MMLU data set, the logic behind is that with router we will be able to save some money and maintain accuracy. So we will compare passing all the questions to one gpt 4.1 model and distributing questions among the models

In [None]:
class MMLURouterTest:
    def __init__(self, temperature: float = 0.2):
        self.temperature = temperature
        self.client = client
        self.choices = ["A", "B", "C", "D"]
        self.domains = ["marketing", "machine_learning"]
        self.results_df = pd.DataFrame()

    def get_mmlu_data(self, domain: str) -> pd.DataFrame:
        """
        Get the MMLU data set for marketing and machine learning
        """

        mmlu_df = pd.DataFrame(load_dataset("cais/mmlu", f"{domain}")["dev"])
        samples_df = mmlu_df.head(20)
        return samples_df

    def get_query(self, row: pd.Series) -> str:
        """
        Create a query for the MMLU data set test
        """

        subject = row["subject"]
        question = row["question"]
        choices = row["choices"]

        formatted_choices = ""
        choice_letters = ["A", "B", "C", "D"]
        for i, choice in enumerate(choices):
            formatted_choices += f"{choice_letters[i]}. {choice}\n"

        formatted_query = f"""The following are questions (with answers) about {subject}.

        {question}
        {formatted_choices.strip()}
        Answer with only the letter (A, B, C, or D):"""

        return formatted_query

    def get_simple_gpt41_response(
        self, query: str, temperature: float = None
    ) -> Tuple[str, float]:
        """
        Pass test query to gpt 4.1 without routing and get response
        """

        if temperature is None:
            temperature = self.temperature

        start_time = time.time()
        response = self.client.chat.completions.create(
            model="gpt-4.1",
            messages=[{"role": "user", "content": query}],
            temperature=temperature,
            max_tokens=1,
        )
        end_time = time.time()
        latency = end_time - start_time

        return response.choices[0].message.content.strip(), latency

    def calculate_accuracy(
        self, predictions: pd.Series, true_answers: pd.Series
    ) -> float:
        """Helper method to calculate accuracy"""

        return (predictions == true_answers).mean() * 100

    def get_router_model_stats(self, router_models: pd.Series) -> dict:
        """Helper method to get router model usage statistics"""

        return router_models.value_counts(normalize=True).mul(100).to_dict()

    def get_accuracy_and_stats(self) -> pd.DataFrame:
        """Return accuracy and model usage stats as DataFrame including latency"""

        if self.results_df.empty:
            return pd.DataFrame()

        # Calculate accuracies
        router_correct = (
            self.results_df["router_answer"] == self.results_df["true_answer"]
        ).sum()
        simple_correct = (
            self.results_df["simple_gpt41_answer"] == self.results_df["true_answer"]
        ).sum()
        total = len(self.results_df)

        router_accuracy = (router_correct / total) * 100
        simple_accuracy = (simple_correct / total) * 100

        # Calculate model usage percentages
        model_counts = self.results_df["router_model"].value_counts()
        model_percentages = (model_counts / total * 100).to_dict()

        # Calculate average latencies per model
        avg_latencies = {}
        for model in self.results_df["router_model"].unique():
            model_latencies = self.results_df[self.results_df["router_model"] == model][
                "router_latency"
            ]
            avg_latencies[f"{model}_avg_latency"] = model_latencies.mean()
        simple_avg_latency = self.results_df["simple_gpt41_latency"].mean()

        metrics = [
            "router_accuracy",
            "simple_gpt41_accuracy",
            "simple_gpt41_avg_latency",
        ]
        values = [router_accuracy, simple_accuracy, simple_avg_latency]

        metrics.extend(
            [f"{model}_usage_percentage" for model in model_percentages.keys()]
        )
        values.extend(list(model_percentages.values()))

        metrics.extend(list(avg_latencies.keys()))
        values.extend(list(avg_latencies.values()))

        stats_data = {"metric": metrics, "value": values}

        return pd.DataFrame(stats_data)

    def get_detailed_stats(self) -> dict:
        """Return detailed statistics as a dictionary including latency stats"""

        if self.results_df.empty:
            return {}

        router_accuracy = self.calculate_accuracy(
            self.results_df["router_answer"], self.results_df["true_answer"]
        )
        simple_gpt41_accuracy = self.calculate_accuracy(
            self.results_df["simple_gpt41_answer"], self.results_df["true_answer"]
        )

        latency_stats = {}
        for model in self.results_df["router_model"].unique():
            model_latencies = self.results_df[self.results_df["router_model"] == model][
                "router_latency"
            ]
            latency_stats[model] = {
                "avg_latency": model_latencies.mean(),
            }

        simple_latencies = self.results_df["simple_gpt41_latency"]
        latency_stats["simple_gpt41"] = {
            "avg_latency": simple_latencies.mean(),
        }

        return {
            "overall_accuracy": {
                "router": router_accuracy,
                "simple_gpt41": simple_gpt41_accuracy,
                "difference": router_accuracy - simple_gpt41_accuracy,
            },
            "router_model_usage": self.get_router_model_stats(
                self.results_df["router_model"]
            ),
            "latency_statistics": latency_stats,
            "total_questions": len(self.results_df),
            "correct_answers": {
                "router": sum(self.results_df["router_correct"]),
                "simple_gpt41": sum(self.results_df["simple_gpt41_correct"]),
            },
        }

    def pass_questions(self) -> pd.DataFrame:
        """
        Pass questions to the router and get the results
        """

        all_results = []

        for domain in self.domains:
            print(f"Processing domain: {domain}")
            df = self.get_mmlu_data(domain)

            for index, row in df.iterrows():
                query = self.get_query(row)

                start_time = time.time()
                router = RouteLLM("", query, self.temperature, self.client)
                router_answer, router_model = router.orchestrate_response()
                end_time = time.time()
                router_latency = end_time - start_time

                simple_gpt41_answer, simple_gpt41_latency = (
                    self.get_simple_gpt41_response(query, self.temperature)
                )

                true_answer = self.choices[row["answer"]]
                router_correct = router_answer == true_answer
                simple_gpt41_correct = simple_gpt41_answer == true_answer

                result = {
                    "subject": row["subject"],
                    "question": row["question"],
                    "choices": row["choices"],
                    "true_answer": true_answer,
                    "router_model": router_model,
                    "router_answer": router_answer,
                    "router_latency": router_latency,
                    "router_correct": router_correct,
                    "simple_gpt41_answer": simple_gpt41_answer,
                    "simple_gpt41_latency": simple_gpt41_latency,
                    "simple_gpt41_correct": simple_gpt41_correct,
                }
                all_results.append(result)

        self.results_df = pd.DataFrame(all_results)
        return self.results_df


tester = MMLURouterTest()
results = tester.pass_questions()

stats_df = tester.get_accuracy_and_stats()

Processing domain: marketing
analysis=Analysis(task_type='multiple-choice question about marketing concepts', complexity_score=2, domain='marketing', performance_needs='speed', estimated_tokens=30, creativity='low') routing_decision=RoutingDecision(recommended_llm='gpt-4.1-nano', confidence=0.9, reasoning='The question tests basic marketing knowledge, requiring a straightforward answer with low complexity and minimal creativity, suitable for the faster, cost-effective gpt-4.1-nano model.')
gpt-4.1-nano
A
analysis=Analysis(task_type='multiple-choice question about organizational terminology in marketing', complexity_score=2, domain='marketing', performance_needs='speed', estimated_tokens=40, creativity='low') routing_decision=RoutingDecision(recommended_llm='gpt-4.1-nano', confidence=0.8, reasoning='The question is straightforward and factual, requiring low complexity and minimal creativity, suitable for the faster, cost-effective gpt-4.1-nano model.')
gpt-4.1-nano
D
analysis=Analysis(t

In [None]:
results.head()

Unnamed: 0,subject,question,choices,true_answer,router_model,router_answer,router_latency,router_correct,simple_gpt41_answer,simple_gpt41_latency,simple_gpt41_correct
0,marketing,_____________ is a natural outcome when combi...,"[Geodemographics, Product differentiation., AN...",A,gpt-4.1-nano,A,3.957749,True,A,0.977545,True
1,marketing,"In an organization, the group of people tasked...","[Outsourcing unit., Procurement centre., Chief...",D,gpt-4.1-nano,D,6.238024,True,D,0.292945,True
2,marketing,Which of the following is an assumption in Ma...,[Needs are dependent on culture and also on so...,B,gpt-4.1-nano,B,1.859863,True,B,0.855934,True
3,marketing,The single group within society that is most v...,[The older consumer who feels somewhat left ou...,D,gpt-4.1-nano,D,1.675521,True,D,0.530231,True
4,marketing,Although the content and quality can be as con...,"[Care lines., Direct mail., Inserts., Door to ...",D,gpt-4.1-nano,A,2.445542,False,C,0.289061,False


In [None]:
stats_df

Unnamed: 0,metric,value
0,router_accuracy,80.0
1,simple_gpt41_accuracy,90.0
2,simple_gpt41_avg_latency,0.586632
3,gpt-4.1-nano_usage_percentage,70.0
4,gpt-4.1_usage_percentage,30.0
5,gpt-4.1-nano_avg_latency,3.99118
6,gpt-4.1_avg_latency,2.056517


From results:

* Router accuracy is lower compared to just sending requests directly to GPT-4.1
* On the other hand, our router chose GPT-4.1 Nano 70% of the time, which means we would reduce costs since GPT-4.1 costs $2 per 1M tokens.
* Since we introduced an additional layer, we have a substantial increase in latency, so when we build a production-ready gateway, we should consider this and implement different latency reduction strategies

# 3 Router Optimisation

The next step was to address all the issues I found during the first iteration. In this step, I implemented the following changes:

* Reduced the size of the prompt and improved it using additional tools
* Removed the analysis part
* Added the ability to handle files, which would also impact model selection
* Added prompt checking to identify more sophisticated prompts
* Handeled error with temperature pass to o4-mini

## 3.1 Improved Router

Again, first I developed a version with which users could interact, and then I tested it on the MMLU benchmark

In [None]:
class RoutingDecision(BaseModel):
    """Decision about which LLM to use for the query."""

    model_config = ConfigDict(str_strip_whitespace=True)

    recommended_llm: Literal["gpt-4.1-nano", "gpt-4.1", "o4-mini"]


class EvaluationResponse(BaseModel):
    """Complete evaluation response containing analysis and routing decision."""

    routing_decision: RoutingDecision


class RouteLLM:

    def __init__(
        self,
        query: str,
        temperature: float,
        client: OpenAI,
        file_path: Optional[str] = None,
    ) -> None:
        self.query = query
        self.temperature = temperature
        self.client = client
        self.file_path = file_path
        self.file_added = bool(file_path)
        self.file_type = self._get_file_type() if self.file_added else None
        self.xml_true = "<" in query and ">" in query
        self.uploaded_file_id = None

    def _get_file_type(self) -> str:
        """
        Checks the file type and returns the type
        """
        if not self.file_path:
            return ""
        ext = os.path.splitext(self.file_path)[1].lower()
        if ext in [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".webp"]:
            return "image"
        elif ext == ".pdf":
            return "pdf"
        return "file"

    def _encode_file(self) -> str:
        """
        Encodes the file and returns the base64 string
        """
        if not self.file_path or not os.path.exists(self.file_path):
            return None

        with open(self.file_path, "rb") as file:
            return base64.b64encode(file.read()).decode("utf-8")

    def _upload_file(self) -> Optional[str]:
        """
        Upload file using Files API and return file ID
        """
        with open(self.file_path, "rb") as file:
            file_upload = self.client.files.create(file=file, purpose="assistants")
            return file_upload.id

    def format_evaluation_prompt(self) -> str:
        enc = tiktoken.get_encoding("o200k_base")
        num_tokens = len(enc.encode(self.query))

        system_prompt = f"""
        ### Context
        Evaluate the user query by emphasizing key parameters: task complexity, creativity level (`{self.temperature}`), required token count (`{num_tokens}`), file details (`{self.file_added}`, `{self.file_type}`)
        and complex queries (`{self.xml_true}`). These factors and cost are essential in guiding the model selection process effectively.

        ### Task
        Assign the Suitable AI Model for the Query:
        - **gpt-4.1-nano**: Ideal for tasks requiring minimal reasoning and creativity, perfect for straightforward, concise responses and summaries. Extremely cost-effective at $0.2/1M tokens. Not good for file interaction (pdf, images).
        - **gpt-4.1**: Best for moderately complex tasks with a need for balanced creativity and reasoning. Handles coding, basic file interactions, and image processing efficiently. Priced at $2.00/1M tokens.
        - **o4-mini**: Suited for highly complex tasks involving deep reasoning and significant creativity. Excels in nuanced, structured analysis and comprehensive file and image processing. Available at $1.10/1M tokens.
        """

        return system_prompt

    def get_evaluation_response(self) -> str:
        """
        Evaluate the user query and return the recommended model
        """
        response = self.client.beta.chat.completions.parse(
            model="gpt-4.1-nano",
            messages=[
                {"role": "system", "content": self.format_evaluation_prompt()},
                {"role": "user", "content": self.query},
            ],
            temperature=0.2,
            response_format=EvaluationResponse,
        )

        validated_response = response.choices[0].message.parsed
        return validated_response.routing_decision.recommended_llm

    def get_gpt_response(self, model_name: str) -> str:
        """
        Gets response from the routed model
        """

        messages = []

        if self.file_path and os.path.exists(self.file_path):
            if self.file_type == "image":
                base64_image = self._encode_file()
                if base64_image:
                    ext = os.path.splitext(self.file_path)[1][1:]
                    messages.append(
                        {
                            "role": "user",
                            "content": [
                                {"type": "text", "text": self.query},
                                {
                                    "type": "image_url",
                                    "image_url": {
                                        "url": f"data:image/{ext};base64,{base64_image}"
                                    },
                                },
                            ],
                        }
                    )
                else:
                    messages.append({"role": "user", "content": self.query})
            elif self.file_type == "pdf":
                if not self.uploaded_file_id:
                    self.uploaded_file_id = self._upload_file()

                messages.append(
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": self.query},
                            {
                                "type": "file",
                                "file": {"file_id": self.uploaded_file_id},
                            },
                        ],
                    }
                )
            else:
                messages.append({"role": "user", "content": self.query})
        else:
            messages.append({"role": "user", "content": self.query})

        params = {
            "model": model_name,
            "messages": messages,
        }

        if not model_name.startswith("o4"):
            params["temperature"] = self.temperature

        response = self.client.chat.completions.create(**params)
        return response.choices[0].message.content

    def orchestrate_response(self) -> str:
        """
        Based on the evaluation response, get the response from the routed model
        """

        recommended_llm = self.get_evaluation_response()
        print(recommended_llm)
        if recommended_llm == "gpt-4.1-nano":
            return self.get_gpt_response("gpt-4.1-nano")
        elif recommended_llm == "gpt-4.1":
            return self.get_gpt_response("gpt-4.1")
        elif recommended_llm == "o4-mini":
            return self.get_gpt_response("o4-mini")


query = input("Enter your query: ")
default_temperature = 0.7

temp_input = input(
    f"Enter the temperature or leave blank for default, default is [{default_temperature}]: "
)

if temp_input.strip() == "":
    temperature = default_temperature
else:
    temperature = float(temp_input)

file_path = input("Enter file path (image/PDF/other, leave blank for none): ")
file_path = file_path.strip() if file_path.strip() else None

llm = RouteLLM(query, temperature, client, file_path)
llm.orchestrate_response()

Enter your query: how to solve this task in best possible way?
Enter the temperature or leave blank for default, default is [0.7]: 0.7
Enter file path (image/PDF/other, leave blank for none): /content/AI Engineer Homework.pdf
gpt-4.1


'Here\'s a step-by-step plan to solve the assignment in the **best possible way**, based on the requirements you provided:\n\n---\n\n## 1. **Understand the Problem and Goal**\n\nYou need to design a proof-of-concept **smart routing feature** for an LLM Gateway. This router will decide which LLM to route a prompt to, based on prompt properties and LLM attributes (cost, speed, feature set, etc.).\n\n---\n\n## 2. **Break Down the Assignment**\n\n### **A. Evaluate Prompt Properties**\n- Identify the key properties of prompts that could affect routing (e.g., prompt length, type of task, required creativity, expected output length, etc.).\n- Map these to LLM attributes (e.g., some LLMs are faster, some are cheaper, some are more capable at reasoning or creativity).\n\n### **B. Create Dataset & Experiment with Fine-Tuning**\n- Build a **small dataset** of prompts labeled with the "best" LLM for each, based on your mapping/criteria.\n- Experiment with a routing model—this could be a simple cla

## 3.2 Improve Router Evaluation

Since in testing I won't pass any files, I have simplified the code just to get output from models and evaluate the improved prompt

In [None]:
class RoutingDecision(BaseModel):
    """Decision about which LLM to use for the query."""

    model_config = ConfigDict(str_strip_whitespace=True)

    recommended_llm: Literal["gpt-4.1-nano", "gpt-4.1", "o4-mini"]


class EvaluationResponse(BaseModel):
    """Complete evaluation response containing analysis and routing decision."""

    routing_decision: RoutingDecision


class RouteLLM:

    def __init__(self, query: str, temperature: float, client: OpenAI) -> None:
        self.query = query
        self.temperature = temperature
        self.client = client
        self.xml_true = "<" in query and ">" in query

    def format_evaluation_prompt(self) -> str:
        """
        Format the evaluation prompt based on passed temperature, number of tokens and if the query is complex
        """

        enc = tiktoken.get_encoding("o200k_base")
        num_tokens = len(enc.encode(self.query))

        system_prompt = f"""
        ### Context
        Evaluate the user query by emphasizing key parameters: task complexity, creativity level (`{self.temperature}`), required token count (`{num_tokens}`) and complex queries (`{self.xml_true}`). These factors and cost are essential in guiding the model selection process effectively.

        ### Task
        Assign the Suitable AI Model for the Query:
        - **gpt-4.1-nano**: Ideal for tasks requiring minimal reasoning and creativity, perfect for straightforward, concise responses and summaries. Extremely cost-effective at $0.2/1M tokens. Never use with files.
        - **gpt-4.1**: Best for moderately complex tasks with a need for balanced creativity and reasoning. Handles coding, basic file interactions, and image processing efficiently. Priced at $2.00/1M tokens.
        - **o4-mini**: Suited for highly complex tasks involving deep reasoning and significant creativity. Excels in nuanced, structured analysis and comprehensive file and image processing. Available at $1.10/1M tokens.
       """

        return system_prompt

    def get_evaluation_response(self) -> str:
        """
        Get the routing decision from the model
        """

        response = self.client.beta.chat.completions.parse(
            model="gpt-4.1-nano",
            messages=[
                {"role": "system", "content": self.format_evaluation_prompt()},
                {"role": "user", "content": self.query},
            ],
            temperature=0.2,
            response_format=EvaluationResponse,
        )

        validated_response = response.choices[0].message.parsed
        return validated_response.routing_decision.recommended_llm

    def get_gpt_response(self, model_name: str) -> str:
        """
        Get the response from routed model
        """

        messages = [{"role": "user", "content": self.query}]

        params = {
            "model": model_name,
            "messages": messages,
        }

        if not model_name.startswith("o4"):
            params["temperature"] = self.temperature

        response = self.client.chat.completions.create(**params)
        return response.choices[0].message.content

    def orchestrate_response(self) -> tuple[str, str]:
        """
        Orchestrate the response from the routed model
        """

        recommended_llm = self.get_evaluation_response()
        print(recommended_llm)

        if recommended_llm == "gpt-4.1-nano":
            response = self.get_gpt_response("gpt-4.1-nano")
        elif recommended_llm == "gpt-4.1":
            response = self.get_gpt_response("gpt-4.1")
        elif recommended_llm == "o4-mini":
            response = self.get_gpt_response("o4-mini")

        return response, recommended_llm

In [None]:
tester = MMLURouterTest()
results = tester.pass_questions()

stats_df = tester.get_accuracy_and_stats()

Processing domain: marketing
gpt-4.1-nano
gpt-4.1-nano
gpt-4.1-nano
gpt-4.1-nano
gpt-4.1-nano
Processing domain: machine_learning
gpt-4.1
gpt-4.1-nano
gpt-4.1
gpt-4.1
gpt-4.1


In [None]:
results.head()

Unnamed: 0,subject,question,choices,true_answer,router_model,router_answer,router_latency,router_correct,simple_gpt41_answer,simple_gpt41_latency,simple_gpt41_correct
0,marketing,_____________ is a natural outcome when combi...,"[Geodemographics, Product differentiation., AN...",A,gpt-4.1-nano,A,1.389133,True,A,0.43357,True
1,marketing,"In an organization, the group of people tasked...","[Outsourcing unit., Procurement centre., Chief...",D,gpt-4.1-nano,D,0.817189,True,D,0.293068,True
2,marketing,Which of the following is an assumption in Ma...,[Needs are dependent on culture and also on so...,B,gpt-4.1-nano,B,0.728709,True,B,0.408871,True
3,marketing,The single group within society that is most v...,[The older consumer who feels somewhat left ou...,D,gpt-4.1-nano,D,1.034269,True,D,0.921034,True
4,marketing,Although the content and quality can be as con...,"[Care lines., Direct mail., Inserts., Door to ...",D,gpt-4.1-nano,A,0.827034,False,C,0.545155,False


In [None]:
stats_df

Unnamed: 0,metric,value
0,router_accuracy,90.0
1,simple_gpt41_accuracy,90.0
2,simple_gpt41_avg_latency,0.488528
3,gpt-4.1-nano_usage_percentage,60.0
4,gpt-4.1_usage_percentage,40.0
5,gpt-4.1-nano_avg_latency,1.020981
6,gpt-4.1_avg_latency,3.36633


From the results:
* We have increased the accuracy quality and are now on par with simply passing everything to GPT-4.1; however, 60% of the time we use GPT Nano
* Another important thing to evaluate is why o1-mini wasn't selected—whether it's an issue with the prompt or the tasks are not tailored for this model based on its description
* I still lack performance in terms of latency, but we should still have lower usage costs since GPT-4.1 Nano was used more often

In [None]:
print(list(results["question"].unique()))

[' _____________ is a natural outcome when combining demographic and geographic variables.', 'In an organization, the group of people tasked with buying decisions is referred to as the _______________.', " Which of the following is an assumption in Maslow's hierarchy of needs?", 'The single group within society that is most vulnerable to reference group influence is:', 'Although the content and quality can be as controlled as direct mail, response rates of this medium are lower because of the lack of a personal address mechanism. This media format is known as:', 'A 6-sided die is rolled 15 times and the results are: side 1 comes up 0 times; side 2: 1 time; side 3: 2 times; side 4: 3 times; side 5: 4 times; side 6: 5 times. Based on these results, what is the probability of side 3 coming up when using Add-1 Smoothing?', 'Which image data augmentation is most common for natural images?', 'You are reviewing papers for the World’s Fanciest Machine Learning Conference, and you see submissio

There are questions which could be sent to the reasoning model; however, the description could be not clear enough. Hopefully, the fine-tuning will help

# 4 Finetuning  

We got quite good results from the initial smart router where I used LLMs to route user queries. Now I will try to fine-tune a small model to see how it performs. Since I don't have many resources, I will use DistilBERT. But before fine-tuning, first I need to create a dataset

## 4.1 Data Preparation

In order to create a training set, I will use a chatbot arena dataset which consists of different interactions with different models. I will pass this dataset to GPT as a judge, which will decide which class to assign to each query (where the class represents the model). I will also drop violent queries and different hieroglyphics or unrecognized values

In [None]:
class ClassName(BaseModel):
    """Class name for the user query."""

    model_config = ConfigDict(str_strip_whitespace=True)

    class_name: Literal["gpt-4.1-nano", "gpt-4.1", "o4-mini"]


class EvaluationResponse(BaseModel):
    """Complete evaluation response containing class name."""

    class_name: ClassName


class DataLabel:

    def __init__(self, client: OpenAI):
        self.client = client

    def split_query(self, df: pd.DataFrame) -> pd.DataFrame:
        """Extract user and assistant queries from conversation strings."""

        user_queries = []
        chatbot_queries = []

        for conv_str in df["conversation_a_str"]:
            user_query = ""
            chatbot_query = ""

            messages = re.findall(
                r"\{'content':\s*'(.*?)',\s*'role':\s*'(.*?)'\}", conv_str, re.DOTALL
            )

            for content, role in messages:
                if role == "user" and not user_query:
                    user_query = content
                elif role == "assistant" and not chatbot_query:
                    chatbot_query = content

            user_queries.append(user_query)
            chatbot_queries.append(chatbot_query)

        df.loc[:, "user_query"] = user_queries
        df.loc[:, "chatbot_query"] = chatbot_queries
        return df

    def convert_to_string(self, df: pd.DataFrame, column_list: list) -> pd.DataFrame:
        """Convert columns to string type."""

        for column in column_list:
            df.loc[:, column + "_str"] = df[column].astype(str)
        return df

    def clean_text_selective(self, text: str) -> str:
        """Clean text from special characters and new lines."""

        if pd.isna(text):
            return ""

        text = re.sub(r"[^\w\s\.,!?;:()\-\'\"]+", "", str(text))
        return " ".join(text.split()).strip()

    def prepare_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """Prepare data for labeling."""

        df = self.convert_to_string(df, ["conversation_a"])

        df = self.split_query(df)
        df = df[["user_query"]]

        df = df[df["user_query"] != ""]

        df["user_query"] = df["user_query"].apply(self.clean_text_selective)

        return df

    def prepare_prompt(self) -> str:
        """Prepare prompt for labeling."""

        system_prompt = """
            <prompt>

            ### Context
            Assess the complexity of the user’s question by considering its need for reasoning, technical detail, creativity, domain knowledge, and step-by-step processes. Select the least expensive model that meets the requirements.

            ### Task
            Determine the appropriate AI model to use:
            gpt-4.1-nano: For straightforward tasks, basic questions, and formatting without deep reasoning. Cost: $0.2/1M tokens.
            gpt-4.1: For advanced analysis, programming, domain expertise, creative outputs, and complex syntheses. Cost: $2.00/1M tokens.
            o4-mini: For multi-step reasoning, strategic planning, analytical tasks, and logic problems. Cost: $1.10/1M tokens.

            Output the chosen model's name only.

            ### Examples
            Query: "What is the capital of France?" → gpt-4.1-nano
            Query: "Format this date: 20240315" → gpt-4.1-nano
            Query: "Is 100 > 50?" → gpt-4.1-nano
            Query: "List 5 colors" → gpt-4.1-nano
            Query: "What is better BMW or Audi?" → gpt-4.1-nano

            Query: "Plan optimal route for 5 deliveries across town" → o4-mini
            Query: "Solve: If A→B and B→C, what about A→C?" → o4-mini
            Query: "Analyze pros/cons of remote work policy" → o4-mini
            Query: "How to cross river with fox, chicken, grain?" → o4-mini

            Query: "Write binary search tree in Python" → gpt-4.1
            Query: "Explain quantum computing impact on cryptography" → gpt-4.1
            Query: "Analyze Renaissance art influence on modern design" → gpt-4.1
            Query: "Design machine learning pipeline for fraud detection" → gpt-4.1
            </prompt>
        """

        return system_prompt

    def get_class(self, df: pd.DataFrame) -> list:
        """Get class name for the user query."""

        class_names = []
        for idx, row in df.iterrows():
            response = self.client.beta.chat.completions.parse(
                model="gpt-4.1",
                messages=[
                    {"role": "system", "content": self.prepare_prompt()},
                    {"role": "user", "content": row["user_query"]},
                ],
                temperature=0.2,
                response_format=EvaluationResponse,
            )

            if (
                response.choices[0].message.parsed
                and response.choices[0].message.parsed.class_name
            ):
                class_name = response.choices[0].message.parsed.class_name.class_name
            else:
                class_name = None

            class_names.append(class_name)

        df.loc[:, "class"] = class_names
        return df

    def run_class_label(self, df: pd.DataFrame) -> pd.DataFrame:
        """Run class label for the user query."""

        df = self.prepare_data(df)
        df = self.get_class(df)

        return df


data_label = DataLabel(client)
sample_df = chat_bot_arena_df.head(9000)
result_df = data_label.run_class_label(sample_df)

print(result_df[["user_query", "class"]])

As a result, I got around 8k classified rows—you can find the dataset in the Git repository. I will use this dataset to fine-tune the model.

## 4.2 DistilBERT Finetuning

Main components:
1. RouterDataset class:

A custom dataset that holds the tokenized text queries and their labels (which model should handle them)

2. DataPreparation class:

Cleans the data by removing empty queries and mapping model names to numbers (0, 1, 2)
Splits data into training and validation sets
Converts text queries into tokens that DistilBERT can understand
Calculates class weights to handle imbalanced data (if one model type appears much more than others)

3. WeightedTrainer class:

A custom trainer that uses weighted loss to better handle cases where some classes (models) appear more frequently than others in the training data

4. Main training function:

Sets up DistilBERT with 3 output classes (one for each model)
Trains the model for 5 epochs with specific learning settings
Uses early stopping to prevent overfitting
Saves the trained model and evaluates its performance

In [11]:
class RouterDataset(Dataset):
    """Custom Dataset for model routing"""

    def __init__(self, tokenized_data):
        self.input_ids = tokenized_data["input_ids"]
        self.attention_mask = tokenized_data["attention_mask"]
        self.labels = tokenized_data["labels"]

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": self.attention_mask[idx],
            "labels": self.labels[idx],
        }


class DataPreparation:
    """Handles all data preparation steps for router training"""

    def __init__(
        self,
        df: pd.DataFrame,
        tokenizer_name: str = "distilbert-base-cased",
        max_length: int = 128,
    ):
        self.df = df.copy()
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.max_length = max_length

        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

        self.train_df = None
        self.val_df = None
        self.train_dataset = None
        self.val_dataset = None
        self.class_weights = None
        self.label_mapping = {"gpt-4.1-nano": 0, "gpt-4.1": 1, "o4-mini": 2}

    def clean_data(self) -> pd.DataFrame:
        """Clean and prepare data for training"""

        self.df = self.df.dropna(subset=["class", "user_query"])

        self.df = self.df[self.df["user_query"].str.strip() != ""]
        self.df = self.df.drop_duplicates(subset=["user_query"], keep="first")

        self.df["class"] = self.df["class"].map(self.label_mapping)

        self.df = self.df.dropna(subset=["class"])
        self.df["class"] = self.df["class"].astype(int)

        return self.df

    def split_data(
        self, test_size: float = 0.2, random_state: int = 42
    ) -> tuple[pd.DataFrame, pd.DataFrame]:
        """Split data into train and validation sets"""

        self.train_df, self.val_df = train_test_split(
            self.df,
            test_size=test_size,
            random_state=random_state,
            stratify=self.df["class"],
        )

        self.train_df = self.train_df.reset_index(drop=True)
        self.val_df = self.val_df.reset_index(drop=True)

        return self.train_df, self.val_df

    def calculate_class_weights(self) -> dict:
        """Calculate class weights from training data only"""

        classes = np.unique(self.train_df["class"])
        weights = compute_class_weight(
            "balanced", classes=classes, y=self.train_df["class"]
        )
        self.class_weights = dict(zip(classes, weights))
        return self.class_weights

    def tokenize_data(self, df: pd.DataFrame) -> dict:
        """Tokenize text data"""

        queries = df["user_query"].tolist()

        encoded = self.tokenizer(
            queries,
            padding="max_length",
            truncation=True,
            max_length=self.max_length,
            return_tensors="pt",
            add_special_tokens=True,
        )

        return {
            "input_ids": encoded["input_ids"],
            "attention_mask": encoded["attention_mask"],
            "labels": torch.tensor(df["class"].values, dtype=torch.long),
        }

    def prepare_data(self) -> tuple[RouterDataset, RouterDataset]:
        """Complete data preparation pipeline"""

        self.clean_data()
        self.split_data()
        self.calculate_class_weights()

        train_tokenized = self.tokenize_data(self.train_df)
        val_tokenized = self.tokenize_data(self.val_df)

        self.train_dataset = RouterDataset(train_tokenized)
        self.val_dataset = RouterDataset(val_tokenized)

        return self.train_dataset, self.val_dataset

    def create_dataloaders(
        self, batch_size: int = 16, num_workers: int = 0
    ) -> tuple[DataLoader, DataLoader]:
        """Create PyTorch dataloaders"""

        if self.train_dataset is None or self.val_dataset is None:
            self.prepare_data()

        self.train_loader = DataLoader(
            self.train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=num_workers,
            pin_memory=torch.cuda.is_available(),
        )

        self.val_loader = DataLoader(
            self.val_dataset,
            batch_size=batch_size,
            shuffle=False,
            num_workers=num_workers,
            pin_memory=torch.cuda.is_available(),
        )

        return self.train_loader, self.val_loader


class WeightedTrainer(Trainer):
    """Custom trainer with weighted loss for imbalanced classes"""

    def __init__(self, class_weights=None, *args: Any, **kwargs: Any) -> None:
        super().__init__(*args, **kwargs)

        if class_weights is not None:
            num_classes = len(class_weights)
            weight_list = [class_weights.get(i, 1.0) for i in range(num_classes)]
            self.class_weights = torch.tensor(weight_list, dtype=torch.float32)
        else:
            self.class_weights = None

    def compute_loss(
        self,
        model: torch.nn.Module,
        inputs: Dict[str, torch.Tensor],
        return_outputs: bool = False,
        **kwargs: Any,
    ) -> torch.Tensor:
        """Compute weighted cross-entropy loss"""

        labels = inputs.pop("labels")

        outputs = model(**inputs)
        logits = outputs.logits

        if self.class_weights is not None:
            weights = self.class_weights.to(model.device)
            loss_fct = torch.nn.CrossEntropyLoss(weight=weights)
        else:
            loss_fct = torch.nn.CrossEntropyLoss()

        loss = loss_fct(logits, labels)

        return (loss, outputs) if return_outputs else loss


def compute_metrics(eval_pred) -> dict:
    """Compute metrics for evaluation"""

    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average="weighted"
    )
    accuracy = accuracy_score(labels, predictions)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}


def train_router_model(
    df: pd.DataFrame, output_dir: str = "./router_model"
) -> tuple[WeightedTrainer, DataPreparation]:
    """Main training function"""

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    data_prep = DataPreparation(
        df=df, tokenizer_name="distilbert-base-cased", max_length=128
    )

    train_dataset, val_dataset = data_prep.prepare_data()
    class_weights = data_prep.class_weights

    model = AutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-cased",
        num_labels=3,
        output_attentions=False,
        output_hidden_states=False,
        dropout=0.3,
        attention_dropout=0.1,
    )

    model = model.to(device)

    data_collator = DataCollatorWithPadding(tokenizer=data_prep.tokenizer)

    training_args = TrainingArguments(
        output_dir=output_dir,
        learning_rate=2e-5,
        lr_scheduler_type="cosine",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=32,
        weight_decay=0.1,
        warmup_steps=100,
        logging_dir=f"{output_dir}/logs",
        logging_steps=10,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        save_total_limit=2,
        fp16=torch.cuda.is_available(),
        report_to=["tensorboard"],
        push_to_hub=False,
    )

    trainer = WeightedTrainer(
        class_weights=class_weights,
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[
            EarlyStoppingCallback(
                early_stopping_patience=2, early_stopping_threshold=0.01
            )
        ],
    )

    trainer.train()

    trainer.save_model()
    data_prep.tokenizer.save_pretrained(output_dir)

    eval_results = trainer.evaluate()
    print("\nValidation Results:")
    for key, value in eval_results.items():
        print(f"{key}: {value:.4f}")

    predictions = trainer.predict(val_dataset)
    predicted_labels = np.argmax(predictions.predictions, axis=1)
    true_labels = predictions.label_ids

    reverse_label_mapping = {v: k for k, v in data_prep.label_mapping.items()}
    class_names = [
        reverse_label_mapping[i] for i in range(len(data_prep.label_mapping))
    ]

    print("\nClassification Report:")
    print(
        classification_report(
            true_labels, predicted_labels, target_names=class_names, digits=4
        )
    )

    return trainer, data_prep


print("Sample DataFrame:")
print(model_train_df)
print(f"\nShape: {model_train_df.shape}")

trainer, data_prep = train_router_model(model_train_df, output_dir="./router_model")

Sample DataFrame:
                                             user_query         class
0       What is the difference between OpenCL and CUDA?  gpt-4.1-nano
1     Why did my parent not invite me to their wedding?  gpt-4.1-nano
2                      Fuji vs. Nikon, which is better?  gpt-4.1-nano
3                   How to build an arena for chatbots?       gpt-4.1
4                                     When is it today?  gpt-4.1-nano
...                                                 ...           ...
8174  I want you to act as a linux terminal. I will ...  gpt-4.1-nano
8175  What is the funniest line from a book that you...       gpt-4.1
8176  When I was young, I thought I knew everything ...  gpt-4.1-nano
8177               why is Inter a better font than Stag       o4-mini
8178  Create a sales training complete with exercise...       gpt-4.1

[8179 rows x 2 columns]

Shape: (8179, 2)
Using device: cuda


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.6525,0.671749,0.780233,0.78077,0.780233,0.780422
2,0.4139,0.602251,0.790669,0.799948,0.790669,0.793987
3,0.4856,0.597643,0.790055,0.805427,0.790055,0.794775



Validation Results:
eval_loss: 0.5976
eval_accuracy: 0.7901
eval_precision: 0.8054
eval_recall: 0.7901
eval_f1: 0.7948
eval_runtime: 1.4037
eval_samples_per_second: 1160.4700
eval_steps_per_second: 36.3310
epoch: 3.0000

Classification Report:
              precision    recall  f1-score   support

gpt-4.1-nano     0.8976    0.8056    0.8491       957
     gpt-4.1     0.7233    0.7913    0.7558       436
     o4-mini     0.5836    0.7246    0.6465       236

    accuracy                         0.7901      1629
   macro avg     0.7348    0.7738    0.7505      1629
weighted avg     0.8054    0.7901    0.7948      1629



From the results:
* Overall, we got decent results with a small model and an imbalanced dataset, but there is a clear issue with the o1-mini model, so we would need to bring more samples
* In addition, such an approach to fine-tuning is not the best since the data could be different from our target users. Also, as I wrote before, results could be enhanced by first predicting domain and complexity and then moving to the model classification task
* Another possibility is to move in the direction of RouterLLM where the class is predicted and the likeliest model probability is calculated, rather predicted exact model. However, their approach considers only two models, so it should be updated
* Due to an insufficient amount of data, I get overfitting with a larger number of epochs. For this purpose, new data could be augmented or we could create a bigger dataset

## 4.3 DistilBERT As Router Evaluation

Lastly we need evaluete our performance, and see if we have any of improvemtn in terms of latency, model selection, accuracy and overl results

In [None]:
class RouteLLM:

    def __init__(
        self,
        model_path: str,
        query: str,
        temperature: float,
        model,
        tokenizer,
        client: OpenAI,
    ) -> None:
        self.model_path = model_path
        self.query = query
        self.temperature = temperature
        self.model = model
        self.tokenizer = tokenizer
        self.client = client
        self.xml_true = "<" in query and ">" in query
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def predict_single_query(self, query: str) -> tuple[str, float]:
        """Make prediction for a single query"""

        inputs = self.tokenizer(
            query, padding=True, truncation=True, max_length=128, return_tensors="pt"
        )

        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        self.model.eval()
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_class = torch.argmax(predictions, dim=-1)

        label_names = {0: "gpt-4.1-nano", 1: "gpt-4.1", 2: "o4-mini"}
        predicted_label = label_names[predicted_class.item()]
        confidence = predictions[0][predicted_class].item()

        return predicted_label, confidence

    def get_evaluation_response(self) -> str:
        """
        Get the predicted model
        """

        predicted_model, confidence = self.predict_single_query(self.query)
        return predicted_model

    def get_gpt_response(self, model_name: str) -> str:
        """
        Get the response from router model
        """

        messages = [{"role": "user", "content": self.query}]

        params = {
            "model": model_name,
            "messages": messages,
        }

        if not model_name.startswith("o4"):
            params["temperature"] = self.temperature

        response = self.client.chat.completions.create(**params)
        return response.choices[0].message.content

    def orchestrate_response(self) -> tuple[str, str]:
        """
        Orchestrate the response from the router model
        """

        recommended_llm = self.get_evaluation_response()
        if recommended_llm == "gpt-4.1-nano":
            response = self.get_gpt_response("gpt-4.1-nano")
        elif recommended_llm == "gpt-4.1":
            response = self.get_gpt_response("gpt-4.1")
        elif recommended_llm == "o4-mini":
            response = self.get_gpt_response("o4-mini")

        return response, recommended_llm

In [None]:
class MMLURouterTest:
    def __init__(self, model, tokenizer, temperature: float = 0.2):
        self.temperature = temperature
        self.client = client
        self.model = model
        self.tokenizer = tokenizer
        self.choices = ["A", "B", "C", "D"]
        self.domains = ["marketing", "machine_learning"]
        self.results_df = pd.DataFrame()

    def get_mmlu_data(self, domain: str) -> pd.DataFrame:
        """
        Get the MMLU data for marketing and machine learning domains
        """

        mmlu_df = pd.DataFrame(load_dataset("cais/mmlu", f"{domain}")["dev"])
        samples_df = mmlu_df.head(20)
        return samples_df

    def get_query(self, row: pd.Series) -> str:
        """
        Create a query for the MMLU data
        """

        subject = row["subject"]
        question = row["question"]
        choices = row["choices"]

        formatted_choices = ""
        choice_letters = ["A", "B", "C", "D"]
        for i, choice in enumerate(choices):
            formatted_choices += f"{choice_letters[i]}. {choice}\n"

        formatted_query = f"""The following are questions (with answers) about {subject}.

        {question}
        {formatted_choices.strip()}
        Answer with only the letter (A, B, C, or D):"""

        return formatted_query

    def get_simple_gpt41_response(
        self, query: str, temperature: float = None
    ) -> Tuple[str, float]:
        """
        Get the response from the simple GPT-4.1 model
        """

        start_time = time.time()
        response = self.client.chat.completions.create(
            model="gpt-4.1",
            messages=[{"role": "user", "content": query}],
            temperature=temperature,
            max_tokens=1,
        )
        end_time = time.time()
        latency = end_time - start_time

        return response.choices[0].message.content.strip(), latency

    def get_accuracy_and_stats(self):
        """Return accuracy and model usage stats as DataFrame including latency"""

        if self.results_df.empty:
            return pd.DataFrame()

        router_correct = (
            self.results_df["router_answer"] == self.results_df["true_answer"]
        ).sum()
        simple_correct = (
            self.results_df["simple_gpt41_answer"] == self.results_df["true_answer"]
        ).sum()
        total = len(self.results_df)

        router_accuracy = (router_correct / total) * 100
        simple_accuracy = (simple_correct / total) * 100

        model_counts = self.results_df["router_model"].value_counts()
        model_percentages = (model_counts / total * 100).to_dict()

        avg_latencies = {}
        for model in self.results_df["router_model"].unique():
            model_latencies = self.results_df[self.results_df["router_model"] == model][
                "router_latency"
            ]
            avg_latencies[f"{model}_avg_latency"] = model_latencies.mean()

        simple_avg_latency = self.results_df["simple_gpt41_latency"].mean()

        metrics = [
            "router_accuracy",
            "simple_gpt41_accuracy",
            "simple_gpt41_avg_latency",
        ]
        values = [router_accuracy, simple_accuracy, simple_avg_latency]

        metrics.extend(
            [f"{model}_usage_percentage" for model in model_percentages.keys()]
        )
        values.extend(list(model_percentages.values()))

        metrics.extend(list(avg_latencies.keys()))
        values.extend(list(avg_latencies.values()))

        stats_data = {"metric": metrics, "value": values}

        return pd.DataFrame(stats_data)

    def get_detailed_stats(self):
        """Return detailed statistics as a dictionary including latency stats"""

        if self.results_df.empty:
            return {}

        router_accuracy = self.calculate_accuracy(
            self.results_df["router_answer"], self.results_df["true_answer"]
        )
        simple_gpt41_accuracy = self.calculate_accuracy(
            self.results_df["simple_gpt41_answer"], self.results_df["true_answer"]
        )

        latency_stats = {}
        for model in self.results_df["router_model"].unique():
            model_latencies = self.results_df[self.results_df["router_model"] == model][
                "router_latency"
            ]
            latency_stats[model] = {
                "avg_latency": model_latencies.mean(),
            }

        simple_latencies = self.results_df["simple_gpt41_latency"]
        latency_stats["simple_gpt41"] = {
            "avg_latency": simple_latencies.mean(),
        }

        return {
            "overall_accuracy": {
                "router": router_accuracy,
                "simple_gpt41": simple_gpt41_accuracy,
                "difference": router_accuracy - simple_gpt41_accuracy,
            },
            "router_model_usage": self.get_router_model_stats(
                self.results_df["router_model"]
            ),
            "latency_statistics": latency_stats,
            "total_questions": len(self.results_df),
            "correct_answers": {
                "router": sum(self.results_df["router_correct"]),
                "simple_gpt41": sum(self.results_df["simple_gpt41_correct"]),
            },
        }

    def pass_questions(self):
        all_results = []

        for domain in self.domains:
            print(f"Processing domain: {domain}")
            df = self.get_mmlu_data(domain)

            for index, row in df.iterrows():
                query = self.get_query(row)

                start_time = time.time()
                router = RouteLLM(
                    "", query, self.temperature, self.model, self.tokenizer, self.client
                )
                router_answer, router_model = router.orchestrate_response()
                end_time = time.time()
                router_latency = end_time - start_time

                simple_gpt41_answer, simple_gpt41_latency = (
                    self.get_simple_gpt41_response(query, self.temperature)
                )

                true_answer = self.choices[row["answer"]]
                router_correct = router_answer == true_answer
                simple_gpt41_correct = simple_gpt41_answer == true_answer

                result = {
                    "subject": row["subject"],
                    "question": row["question"],
                    "choices": row["choices"],
                    "true_answer": true_answer,
                    "router_model": router_model,
                    "router_answer": router_answer,
                    "router_latency": router_latency,
                    "router_correct": router_correct,
                    "simple_gpt41_answer": simple_gpt41_answer,
                    "simple_gpt41_latency": simple_gpt41_latency,
                    "simple_gpt41_correct": simple_gpt41_correct,
                }
                all_results.append(result)

        self.results_df = pd.DataFrame(all_results)
        return self.results_df

    def calculate_accuracy(self, predictions, true_answers):
        """Helper method to calculate accuracy"""

        return (predictions == true_answers).mean() * 100

    def get_router_model_stats(self, router_models):
        """Helper method to get router model usage statistics"""

        return router_models.value_counts(normalize=True).mul(100).to_dict()


tester = MMLURouterTest(trainer.model, data_prep.tokenizer)
results = tester.pass_questions()
stats_df = tester.get_accuracy_and_stats()

Processing domain: marketing
Processing domain: machine_learning


In [None]:
results.head()

Unnamed: 0,subject,question,choices,true_answer,router_model,router_answer,router_latency,router_correct,simple_gpt41_answer,simple_gpt41_latency,simple_gpt41_correct
0,marketing,_____________ is a natural outcome when combi...,"[Geodemographics, Product differentiation., AN...",A,gpt-4.1,A,0.522444,True,A,0.549294,True
1,marketing,"In an organization, the group of people tasked...","[Outsourcing unit., Procurement centre., Chief...",D,o4-mini,D,1.941609,True,D,0.360512,True
2,marketing,Which of the following is an assumption in Ma...,[Needs are dependent on culture and also on so...,B,o4-mini,B,1.895787,True,B,1.126806,True
3,marketing,The single group within society that is most v...,[The older consumer who feels somewhat left ou...,D,o4-mini,D,2.299963,True,D,1.210254,True
4,marketing,Although the content and quality can be as con...,"[Care lines., Direct mail., Inserts., Door to ...",D,gpt-4.1-nano,A,0.231174,False,C,0.303575,False


In [None]:
stats_df

Unnamed: 0,metric,value
0,router_accuracy,90.0
1,simple_gpt41_accuracy,80.0
2,simple_gpt41_avg_latency,0.688616
3,o4-mini_usage_percentage,80.0
4,gpt-4.1_usage_percentage,10.0
5,gpt-4.1-nano_usage_percentage,10.0
6,gpt-4.1_avg_latency,0.522444
7,o4-mini_avg_latency,3.384189
8,gpt-4.1-nano_avg_latency,0.231174


From Results:
* This router achieved same results in terms of accuracy as gpt-4.1-nano and outscored plain gpt-4.1 but since previous runs gpt-4.1 also achieved 90% accuracy I wont consider it
* Compared to previous router variants this one majority of the time used o4-mini (80%), which is weird since in training data gpt-4.1-nano was labeled for 60% of queries (4,784/8,142), gpt-4.1 for 27% (2,179), and o4-mini for only 14% (1,179). The router basically flipped the distribution from what GPT-4.1 suggested during labeling. My assumption is that MMLU is not the best dataset to evaluate since each question is academic heavy so router learned to prefer o4-mini over gpt-4.1-nano that was deemed appropriate for most training queries.
* Despite the cost effectiveness is still high and accuracy is good the main problem is latency it have increased since o4-mini is used now. Router changed the usage pattern where training labels had mostly gpt-4.1-nano but evaluation shows 80% o4-mini usage.

# Conclusion

Final Conclusion:

* The improved LLM-based router using GPT-4.1-nano performs best overall, achieving high accuracy with good latency and cost
* The fine-tuned DistilBERT approach shows promise but adds latency due to domain mismatch between training data and MMLU evaluation
* Training data must match target use cases—the router preferred o1-mini for academic content despite training on general queries where GPT-4.1-nano was labeled
* Next steps include reducing latency through caching, exploring hierarchical classification (domain → complexity → model), and fine-tuning larger 4-8B parameter models
* Implementing RouterLLM techniques for multi-model scenarios and hybrid approaches could provide better balance of accuracy, latency, and cost. Future iterations could also explore multi-task learning to train one model on multiple tasks like query classification, domain detection, and complexity assessment, as well as RouterLLM-style approaches that output both classification and probability scores across all three models for better routing decisions