### Use the AIW dataset to benchmark DeepSeek R1

https://github.com/LAION-AI/AIW


In [19]:
!pip install -qU  datasets tqdm huggingface_hub backoff

In [4]:
import getpass
hf_api_key = getpass.getpass("HuggingFace API key: ") 


In [3]:
from datasets import load_dataset

# Load the "Alice in Wonderland" dataset from Hugging Face
dataset = load_dataset("marianna13/aiw-prompts")
data = dataset['test'] if 'test' in dataset else dataset['train']

print(data[:5])  # Inspect the first 5 rows 


  from .autonotebook import tqdm as notebook_tqdm


{'id': [1, 2, 3, 4, 5], 'text': ["Alice has 4 brothers and she also has 1 sister. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:", "Alice has 4 sisters and she also has 1 brother. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:", "Alice has four brothers and she also has one sister. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:", "Alice has four sisters and she also has one brother. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:", "Alice has 4 brothers and she also has 1 sister. How many sisters does Alice's brother have?"], 'right_answer': ['2', '5', '2', '5', 

In [22]:
from huggingface_hub import AsyncInferenceClient
import backoff

client = AsyncInferenceClient(provider="together",
                            api_key=hf_api_key,
                            timeout=60*5 # extend timeout window to five minutes per request
                            )


# Seem to be getting a lot of timeouts / gateway errors from HF, so 
# use a simple retry with exponential backoff
@backoff.on_exception( backoff.expo, Exception, max_tries=5, factor=10,)
async def _query_deepseek_internal(prompt: str):
    response = await client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R1",
        messages=[
            {"role": "user", "content": prompt},
        ],
    )
    return response.choices[0].message.content.strip(), response.usage

async def query_deepseek(prompt: str):
    try:
        return await _query_deepseek_internal(prompt)
    except Exception as e:
        # If it fails after all retries, return error message and empty usage
        return str(e), {}

In [21]:
answer, usage = await query_deepseek("what is the capital of England")
print(answer)
print(usage)


<think>
Okay, the user is asking, "What is the capital of England?" Let's start by confirming that England is part of the United Kingdom, which also includes Scotland, Wales, and Northern Ireland. The capital city of the UK as a whole is London. However, since the question specifically mentions England, I need to make sure that London is indeed the capital of England itself and not just the UK.

Wait, sometimes people might confuse the capitals within the UK. For instance, Scotland's capital is Edinburgh, Wales' is Cardiff, and Northern Ireland's is Belfast. But for England, there isn't a separate capital city distinct from the UK's capital. London serves both roles. So the answer should be London. But maybe I should double-check if there's any historical context or if there's been any change. For example, was there ever a different capital for England in the past? I recall that Winchester was the capital of the ancient kingdom of Wessex and later England before London became prominent

In [None]:
from tqdm.notebook import tqdm

# Validate DeepSeek's R1 model against the dataset
results = []
total_output_tokens = 0
max_single_answer_tokens = 0

r1_token_cost = 7./1e6 #based on Together.ai's pricing 

for i in range(len(data["text"])):
    prompt = data["text"][i]
    expected = data["right_answer"][i]  

    print(f"prompt: {prompt}\ncorrect answer: {expected}")

    # Query the OpenAI model
    generated, usage = await query_deepseek(prompt)
    print(f"generated answer: {generated}\n")

    # Record the results
    results.append({
        "prompt": prompt,
        "expected": expected,
        "generated": generated
    })

    # Keep track of output tokens and total costs
    if usage: # may be null in case of an error
        total_output_tokens += usage.completion_tokens
        if usage.completion_tokens > max_single_answer_tokens: max_single_answer_tokens = usage.completion_tokens
        print(f"total output tokens cost: ${total_output_tokens*r1_token_cost}\n")

print(f"maximum tokens for a single answer: {max_single_answer_tokens}")


prompt: Alice has 4 brothers and she also has 1 sister. How many sisters does Alice's brother have? To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:
correct answer: 2
generated answer: <think>
Okay, let's see. Alice has 4 brothers and 1 sister. The question is asking how many sisters each of Alice's brothers has. Hmm.

First, let's figure out the family structure. Alice is a girl. She has 4 brothers, so those are four male siblings. She also has 1 sister. So that means there's Alice, her sister, and the four brothers. Wait, but wait—are there more sisters? Wait, Alice has 1 sister, so total sisters would be Alice plus that one sister, making two girls. But maybe I'm misunderstanding the setup.

Wait, the problem states: Alice has 4 brothers and 1 sister. So in total, the siblings are Alice, the 4 brothers, and 1 sister. So that's 6 siblings in total: 4 brothers, 2 sisters (Alice and her sister). Now, each brother's sisters w

In [16]:
import json

filename = "r1_results.json"

# Serialize results to a JSON file
with open(filename, "w") as f:
    json.dump({
        "results": results,
        "total_output_tokens": total_output_tokens,
        "max_single_answer_tokens": max_single_answer_tokens,
        "total_cost": total_output_tokens * o1_output_token_cost
    }, f, indent=4)

print(f"Results saved to {filename}")


Results saved to o1_preview_results.json


In [18]:
len(results)

38