## Choosing the right Reasoning model and Reasoning effort for your use case 

Reasoning models, such as OpenAI’s o1 and o3-mini, are advanced language models trained with reinforcement learning to enhance complex reasoning. They generate a detailed internal thought process before responding, making them highly effective in problem-solving, coding, scientific reasoning, and multi-step planning for agentic workflows.

In this Cookbook, we will explore an Eval based quantiative analysis to help you choose the right reasoning model and reasoning effort for your use case. 

This is a 3 step process: 

1. Build Your Evaluation Dataset
2. Build a Pipeline to evaluate the reasoning model and capture metrics 
3. Choose the model/parameter based on cost/performance trade-off 

### Step 1: Build Your Evaluation Dataset 

For this example, we will use the AI2-ARC dataset

ARC-Challenge
id: a string feature.
question: a string feature.
choices: a dictionary feature containing:
text: a string feature.
label: a string feature.
answerKey: a string feature.

In [5]:
import requests

url = "https://huggingface.co/datasets/allenai/ai2_arc/resolve/main/ARC-Challenge/test-00000-of-00001.parquet"
response = requests.get(url)
with open("test-00000-of-00001.parquet", "wb") as f:
    f.write(response.content)

In [7]:
import json
import pandas as pd

# Set Pandas options to display full text in cells
pd.set_option('display.max_colwidth', None)

# Reads the Parquet file into a DataFrame.
df = pd.read_parquet("test-00000-of-00001.parquet")

# Convert the first row to a dictionary.
row_dict = df.head(1).iloc[0].to_dict()

# Pretty-print the row as a JSON string with an indentation of 4 spaces.
# The default lambda converts non-serializable objects (like numpy arrays) to lists.
print(json.dumps(row_dict, indent=4, default=lambda o: o.tolist() if hasattr(o, 'tolist') else o))

{
    "id": "Mercury_7175875",
    "question": "An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?",
    "choices": {
        "text": [
            "Planetary density will decrease.",
            "Planetary years will become longer.",
            "Planetary days will become shorter.",
            "Planetary gravity will become stronger."
        ],
        "label": [
            "A",
            "B",
            "C",
            "D"
        ]
    },
    "answerKey": "C"
}


In [36]:
total_rows = len(df)
print(f"Total number of rows in the dataset: {total_rows}")

# Display the total number of rows in the dataset


Total number of rows in the dataset: 1172


### Step 2: Build a Pipeline to evaluate the reasoning model and capture metrics 

Let's write a python script to evaluate the reasoning model and capture metrics. 


In [33]:
import time 
import openai
from openai import OpenAI

# Initialize the OpenAI client
client = OpenAI()


def response_with_reasoning_effort(model: str, question: str, reasoning_effort: str):
    """
    Send a question to the OpenAI model with a given reasoning effort level.

    Parameters:
        model (str): The name of the model.
        question (str): The input prompt.
        reasoning_effort (str): The reasoning effort level ("low", "medium", or "high").

    Returns:
        answer (str): The model's answer.
        usage: The usage object containing token counts.
        duration (float): Time taken for the API call.
    """

    start_time = time.time()

    # API Call
    response = client.chat.completions.create(
        model=model,
        reasoning_effort=reasoning_effort,
        messages=[
            {"role": "system", "content": "You are a helpful assistant that provides answe to multiple choice questions. Reply only with the letter of the correct answer choice."},
            {"role": "user", "content": question}
        ]
    )

    end_time = time.time()

    # Extract answer from response.
    answer = response.choices[0].message.content.strip()
    usage = response.usage  # Contains prompt_tokens, total_tokens, and (optionally) reasoning_tokens.

    return answer, usage, (end_time - start_time)


Run the pipeline for all the questions in the dataset. 

In [37]:
import time
import random
from tqdm import tqdm

models = ["o3-mini"]

results = []  # to accumulate results for each question, model, and reasoning level

# Assuming df is a DataFrame containing the questions, choices, answerKey, and id.
for item in tqdm(range(total_rows), desc="Processing Questions"):
    q_text = str(df.iloc[item].question) + "\n" + "choices: " + str(df.iloc[item].choices)
    expected = df.iloc[item].answerKey

    for model in models:
        for reasoning_effort in ["low", "medium" ,"high"]:
            try:
                answer, usage, duration = response_with_reasoning_effort(model, q_text, reasoning_effort)
                correct = False
                ans_norm = answer.lower().strip()
                exp_norm = str(expected).lower().strip()

                if exp_norm in ans_norm or ans_norm in exp_norm:
                    correct = True

                results.append({
                    "id": df.iloc[item].id,
                    "model": model,
                    "level": reasoning_effort,
                    "model_answer": answer,
                    "correct": correct,
                    "prompt_tokens": usage.prompt_tokens,
                    "total_tokens": usage.total_tokens,
                    "reasoning_tokens": usage.completion_tokens_details.reasoning_tokens,
                    "duration": duration
                })

                # Add random time delay between 0.01 to 0.05 seconds
                time.sleep(random.uniform(0.01, 0.05))

            except TypeError as e:
                print(f"Error processing question: {df.iloc[item].id} with model {model} at reasoning level {reasoning_effort}: {e}")
                # Skip this combination

# Convert results to DataFrame for further analysis
df_results = pd.DataFrame(results)

Processing Questions:   0%|          | 4/1172 [00:48<3:57:42, 12.21s/it]


KeyboardInterrupt: 

In [28]:
print (df_results)


                   id    model level model_answer  correct  prompt_tokens  \
0     Mercury_7175875  o3-mini   low            C     True            127   
1     Mercury_7175875  o3-mini  high            C     True            127   
2   Mercury_SC_409171  o3-mini   low            B     True            142   
3   Mercury_SC_409171  o3-mini  high            B     True            142   
4   Mercury_SC_408547  o3-mini   low            C     True            133   
5   Mercury_SC_408547  o3-mini  high            C     True            133   
6      Mercury_407327  o3-mini   low            D     True            152   
7      Mercury_407327  o3-mini  high            D     True            152   
8      MCAS_2006_9_44  o3-mini   low            D     True            172   
9      MCAS_2006_9_44  o3-mini  high            D     True            172   
10    Mercury_7270393  o3-mini   low            B     True            174   
11    Mercury_7270393  o3-mini  high            D    False            174   

### Step 3: Choose the model/parameter based on cost/performance trade-off

Let's plot a graph between accuracy and latency for each model/reasoning effort pair.

