### Use the AIW dataset to benchmark DeepSeek R1

https://github.com/LAION-AI/AIW


In [1]:
!pip install -qU  datasets tqdm huggingface_hub backoff

In [2]:
import getpass
hf_api_key = getpass.getpass("HuggingFace API key: ") 


In [3]:
from datasets import load_dataset

# Load the "Alice in Wonderland" dataset from Hugging Face
dataset = load_dataset("marianna13/aiw-prompts")
data = dataset['test'] if 'test' in dataset else dataset['train']

print(data[:5])  # Inspect the first 5 rows 


  from .autonotebook import tqdm as notebook_tqdm


{'id': [1, 2, 3, 4, 5], 'text': ["Alice has 4 brothers and she also has 1 sister. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:", "Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:", "Alice has four brothers and she also has one sister. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:", "Alice has four sisters and she also has one brother. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:", "Alice has 4 brothers and she also has 1 sister. How many sisters does Alice's brother have?"], 'right_answer': ['2', '5', '2', '5', 

In [5]:
from huggingface_hub import AsyncInferenceClient
import backoff

client = AsyncInferenceClient(provider="together",
                            api_key=hf_api_key,
                            timeout=60*2 # extend timeout window to two minutes per request
                            )


# Seem to be getting a lot of timeouts / gateway errors from HF, so 
# use a simple retry with exponential backoff
@backoff.on_exception( backoff.expo, Exception, max_tries=3, )
async def _query_deepseek_internal(prompt: str):
    response = await client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R1",
        messages=[
            {"role": "user", "content": prompt},
        ],
    )
    return response.choices[0].message.content.strip(), response.usage

async def query_deepseek(prompt: str):
    try:
        return await _query_deepseek_internal(prompt)
    except Exception as e:
        # If it fails after all retries, return error message and empty usage
        return str(e), {}

In [6]:
answer, usage = await query_deepseek("what is the capital of England")
print(answer)
print(usage)


<think>
Okay, the user is asking for the capital of England. Let me start by recalling basic geography. England is a part of the United Kingdom, right? So, the capital city... Hmm. London comes to mind immediately. But wait, I should make sure I'm not confusing it with other parts of the UK. Like, Scotland's capital is Edinburgh, Wales has Cardiff, and Northern Ireland's is Belfast. But England's capital is definitely London. Let me double-check. Yes, London is not only the capital of England but also the largest city in the UK. It's where the government is based, with places like the Houses of Parliament and Buckingham Palace. I don't think that's changed. Some people might get confused because the UK has multiple countries within it, each with their own capitals, but England's is London. So the answer should be straightforward.
</think>

The capital of England is **London**. 

London is not only the capital of England but also the largest city in the United Kingdom. It is a major glo

In [23]:
from tqdm.notebook import tqdm

# Validate DeepSeek's R1 model against the dataset
results = []
total_output_tokens = 0
max_single_answer_tokens = 0

r1_token_cost = 7./1e6 #based on Together.ai's pricing 

for i in range(len(data["text"])):
    prompt = data["text"][i]
    expected = data["right_answer"][i]  

    print(f"prompt: {prompt}\ncorrect answer: {expected}")

    generated, usage = await query_deepseek(prompt)
    print(f"generated answer: {generated}\n")

    # Record the results
    results.append({
        "prompt": prompt,
        "expected": expected,
        "generated": generated
    })

    # Keep track of output tokens and total costs
    if usage: # may be null in case of an error
        total_output_tokens += usage.completion_tokens
        if usage.completion_tokens > max_single_answer_tokens: max_single_answer_tokens = usage.completion_tokens
        print(f"total output tokens cost: ${total_output_tokens*r1_token_cost}\n")

print(f"maximum tokens for a single answer: {max_single_answer_tokens}")


prompt: Alice has 4 brothers and she also has 1 sister. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:
correct answer: 2
generated answer: <think>
Okay, let's see. Alice has 4 brothers and 1 sister. The question is asking how many sisters each of Alice's brothers has. Hmm.

First, let's figure out the family structure. Alice is a girl. She has 4 brothers, so those are four male siblings. She also has 1 sister. So that means there's Alice, her sister, and the four brothers. Wait, but wait—are there more sisters? Wait, Alice has 1 sister, so total sisters would be Alice plus that one sister, making two girls. But maybe I'm misunderstanding the setup.

Wait, the problem states: Alice has 4 brothers and 1 sister. So in total, the siblings are Alice, the 4 brothers, and 1 sister. So that's 6 siblings in total: 4 brothers, 2 sisters (Alice and her sister). Now, each brother's sisters w

In [25]:
import json

def write_r1_results(filename = "r1_results.json"):

    # Serialize results to a JSON file
    with open(filename, "w") as f:
        json.dump({
            "results": results,
            "total_output_tokens": total_output_tokens,
            "max_single_answer_tokens": max_single_answer_tokens,
            "total_cost": total_output_tokens * r1_token_cost
        }, f, indent=4)

    print(f"Results saved to {filename}")


Results saved to r1_results.json


In [26]:
len(results)

274

In [7]:
import re

debug = False

def print_results(results):

    total_timeouts = 0
    total_correct = 0
    total_incorrect = 0
    total_no_answer = 0

    for result in results:

        if debug: print(f"generated: {result["generated"]}\nexpected:{result["expected"]}")

        if "504" in result["generated"]:
            total_timeouts += 1
        else:
            match = re.search(r"### Answer:\s*(\d+)", result["generated"])
            if match:
                result["numeric_answer"] = match.group(1)  # The captured digits
                if result["numeric_answer"] == result["expected"]:
                    total_correct += 1
                else:
                    total_incorrect += 1
            else:
                result["numeric_answer"] = None
                if debug: print(f"malformatted answer: {result["generated"]}")
                total_no_answer +=1

    print(f"Results:\n\ttimeout: {total_timeouts}\n\tcorrect:{total_correct}\n\tincorrect:{total_incorrect}\n\tmalformatted answer: {total_no_answer}")


In [8]:
import os
import json

def load_r1_results(filename="r1_results.json"):
    """
    Checks if r1_results.json exists, and if so, reads and returns its content.
    Returns None if the file does not exist.
    """
    if os.path.exists(filename):
        with open(filename, "r") as f:
            return json.load(f)
    else:
        print(f"{filename} does not exist.")
        return None

In [None]:
# try to re-run just the timeouts. Are these timing out because they are harder, or randomly?
results = load_r1_results()["results"]
print(f"Loaded {len(results)} results")

# Validate DeepSeek's R1 model against the dataset
max_single_answer_tokens = 0

r1_token_cost = 7./1e6 #based on Together.ai's pricing 

for result in results:
    if "504" in result["generated"]:
        prompt = result["prompt"]
        expected = result["expected"]
        print(f"prompt: {prompt}\ncorrect answer: {expected}")

        generated, usage = await query_deepseek(prompt)
        print(f"generated answer: {generated}\n")
        
        if "504" not in result["generated"]:
            # Record the results
            result["generated"]=generated

            # Keep track of output tokens and total costs
            if usage: # may be null in case of an error
                total_output_tokens += usage.completion_tokens
                if usage.completion_tokens > max_single_answer_tokens: max_single_answer_tokens = usage.completion_tokens
                print(f"total output tokens cost: ${total_output_tokens*r1_token_cost}\n")

print(f"maximum tokens for a single answer: {max_single_answer_tokens}")

Loaded 274 results
prompt: Alice has 4 brothers and she also has 1 sister. How many sisters does Alice's brother have?
correct answer: 2
