## Reasoning experiments on Zebra Puzzle

Goal: compare models with reasoning On and Off using Zebra Puzzle

Based on https://huggingface.co/blog/yuchenlin/zebra-logic

### From Future Directions

- More reasoning methods: We are interested in evaluating LLM agents (e.g., ReAct, Reflexion, SwiftSage). Also, we'll explore advanced prompting and fine-tuning methods like Tree of Thoughts, Flow of Reasoning, etc.
- More evaluation methods: We are considering trying a multiple-choice format for faster evaluation. Also, the language of the clues can be further paraphrased to be more natural and diverse.
- Fine-tuning with Logic Puzzles: Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?
- **Analyze the internal reasoning mechanism of LLMs: how do LLMs reason correctly and incorrectly?**
- More tasks: We'll add more types of logic puzzles that require a more diverse set of reasoning abilities in the evaluation.

### How does reasoning work in APIs

- https://console.groq.com/docs/reasoning
- https://ai.google.dev/gemini-api/docs/thinking
- https://api-docs.deepseek.com/guides/reasoning_model
- https://platform.openai.com/docs/guides/reasoning/use-case-examples?api-mode=chat
- https://docs.api.nvidia.com/nim/reference/nvidia-llama-3_1-nemotron-ultra-253b-v1#quick-start-and-usage-recommendations
- https://docs.litellm.ai/docs/reasoning_content

In [None]:
import zebra_utils
import json
import config
config.update_environ()

#model = 'qwen/qwen3-32b-default@groq'
#model = 'qwen-qwq-32b@groq'
#model = 'deepseek-r1-distill-llama-70b@groq'
#model = 'llama3-8b-8192@groq' # reasoning_effort not supported
#model = 'gemma2-9b-it@groq' # reasoning_effort not supported
#model = 'allam-2-7b@groq' # reasoning_effort not supported
#model = "gemini-2.5-flash-preview-05-20-default@google" # reasoning_effort: none, low, medium, high
#model = "gemini-2.5-flash-preview-05-20-high@google" # reasoning_effort: none, low, medium, high
#model = "gemini-2.5-pro-exp-03-25-default@google"
model = "nvidia/llama-3.1-nemotron-ultra-253b-v1@nvidia" # with <think>
#model = "nvidia/nemotron-4-340b-instruct@nvidia" # no think, max 8k tokens

size = '3*3'
max_tokens = 1024*8

dataset = zebra_utils.get_zebra_dataset(size, random=True)

item = dataset[0]
prompt = zebra_utils.get_zebra_grid_prompt(item)
solution_expected = zebra_utils.get_zebra_grid_solution(item)

answers,thoughts,usage = zebra_utils.openai_chat_request(model, prompt=prompt, max_tokens=max_tokens, json_mode=False)

jsonstr, text = zebra_utils.extract_json(answers[0])
answer = json.loads(jsonstr)
thought = thoughts[0] if len(thoughts) > 0 else None

print(item)
print(answer)
print(thought)
print(usage)
print(answer['solution'] == solution_expected)


Received response from OpenAI API with model nvidia/llama-3.1-nemotron-ultra-253b-v1
ChatCompletion(id='chatcmpl-ae512adfecbb4c05994222e6a59df46b', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='<think>\nOkay, let\'s try to solve this puzzle step by step. So, there are three houses with different people, each having a unique name, hobby, and mother\'s name. The names are Arnold, Eric, Peter; hobbies are photography, cooking, gardening; mothers are Aniya, Janelle, Holly. Let\'s go through the clues one by one.\n\nClue 1 says Arnold is in the first house. So House 1\'s Name is Arnold. That\'s straightforward.\n\nClue 2: There\'s one house between the person whose mother is Holly and Eric. So Holly\'s house and Eric\'s house are separated by one house. That could mean Holly is in House 1 and Eric in House 3, or Holly in House 3 and Eric in House 1. But since Arnold is already in House 1, Eric can\'t be in House 1. So if Holly is in Hou

In [None]:
import zebra_utils
import json
import datetime
import config
config.update_environ()

# OpenAI models: https://platform.openai.com/docs/pricing
#model = "gpt-4.1-nano-high" # no free

# Anthropic models: https://docs.anthropic.com/en/docs/about-claude/models/overview#model-aliases
#model = "claude-sonnet-4-0@anthropic" # no free

# X.ai models: https://docs.x.ai/docs/models?cluster=us-east-1
#model = "grok-3-mini-high@xai" # no free

# Deepseek models: https://api-docs.deepseek.com/
#model = "deepseek-reasoner" # no free

# Nvidia models:
model = 'nvidia/llama-3.1-nemotron-70b-instruct@nvidia'
model = 'nvidia/llama-3.1-nemotron-ultra-253b-v1@nvidia'

# Google models: https://ai.google.dev/gemini-api/docs/models
#model = "gemini-2.5-pro-exp-03-25-none@google" # pro not free
#model = "gemini-2.5-flash-preview-05-20-high@google" # reasoning_effort: none, low, medium, high, default (auto)
#model = "gemini-2.0-flash@google" # reasoning_effort not supported

# Groq models: https://console.groq.com/docs/models
#model = "qwen/qwen3-32b-default@groq" # reasoning_effort: none and default
model = "qwen/qwen3-32b-none@groq" # reasoning_effort: none and default
#model = "qwen-qwq-32b@groq" # reasoning_effort not supported
#model = 'deepseek-r1-distill-llama-70b-default@groq' # reasoning_effort not supported, with <think> always!
#model = 'llama-3.3-70b-versatile@groq'
#model = 'llama3-8b-8192@groq' # reasoning_effort not supported
#model = 'gemma2-9b-it@groq' # reasoning_effort not supported
#model = 'allam-2-7b@groq' # reasoning_effort not supported

size = '2*2'
count = 3
successed = 0
max_tokens = 1024*4

print(f'Testing {count} items {size} on {model}')

dataset = zebra_utils.get_zebra_dataset(size, random=True)
dt = datetime.datetime.now()
res_items = []
total_tokens = 0

for i in range(0,count):
    item = dataset[i]
    prompt = zebra_utils.get_zebra_grid_prompt(item)
    solution_expected = zebra_utils.get_zebra_grid_solution(item)
    thought = None
    usage = None

    print(item['id'])
    #display(prompt)
    #display(Markdown('# Model: ' + model))
    #display(Markdown(prompt))

    try:
        answers, thoughts, usage = zebra_utils.openai_chat_request(model, prompt=prompt, max_tokens=max_tokens, json_mode=False)
        if usage is not None: total_tokens += usage['total_tokens']
        jsonstr, text = zebra_utils.extract_json(answers[0])
        answer = json.loads(jsonstr)
        thought = thoughts[0] if len(thoughts) > 0 else None

        display(answer)
        #display(Markdown(answer[0]))

        solution_answered = answer['solution']
        print('Expected:', solution_expected)
        print('Answered:', solution_answered)
        success = solution_expected == solution_answered
        if success: successed += 1
    except Exception as ex:
        print(ex)
        success = False
        if hasattr(ex, 'body'):
            answer = ex.body
        else:
            answer = str(ex)
    print(('FAILURE', 'SUCCESS')[success])

    res_items.append(zebra_utils.generate_result_item(item, prompt, answer, success, thought, usage))

results = {
    "model": model,
    "size": size,
    "max_tokens": max_tokens,
    "total_tokens": total_tokens,
    "started": dt.isoformat(),
    "duration": str(datetime.datetime.now()-dt),
    "count": count,
    "successed": successed,
    "rate": int(100 * successed / count),
    "items": res_items,
}

# Write collected info as json
filename = f"results\\{dt.strftime('%Y%m%d_%H%M%S')}_{zebra_utils.normalize_filename(model)}_{size.replace('*','x')}.json"
with open(filename, "w") as f:
    json.dump(results, f, indent=4)

print(f'Passed {int(100 * successed / count)}% ({successed} of {count}) in {datetime.datetime.now()-dt}')
print(f'Results saved to {filename}')

Testing 3 items 2*2 on qwen/qwen3-32b-none@groq
lgp-test-2x2-4
Received response from OpenAI API with model qwen/qwen3-32b
ChatCompletion(id='chatcmpl-6babb268-f022-48d1-881d-c4e57329d752', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n    "reasoning": "From Clue 1, we know the person who owns a Ford F-150 is in House 2. Clue 2 states that Arnold is somewhere to the right of Eric, which means Arnold must be in House 2 and Eric in House 1. Therefore, Arnold owns the Ford F-150 and Eric owns the Tesla Model 3.",\n    "solution": {\n        "House 1": {\n            "Name": "Eric",\n            "CarModel": "tesla model 3"\n        },\n        "House 2": {\n            "Name": "Arnold",\n            "CarModel": "ford f150"\n        }\n    }\n}', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None))], created=1749934616, model='qwen/qwen3-32b', object='chat.completion', service_tier=None,

{'reasoning': 'From Clue 1, we know the person who owns a Ford F-150 is in House 2. Clue 2 states that Arnold is somewhere to the right of Eric, which means Arnold must be in House 2 and Eric in House 1. Therefore, Arnold owns the Ford F-150 and Eric owns the Tesla Model 3.',
 'solution': {'House 1': {'Name': 'Eric', 'CarModel': 'tesla model 3'},
  'House 2': {'Name': 'Arnold', 'CarModel': 'ford f150'}}}

Expected: {'House 1': {'Name': 'Eric', 'CarModel': 'tesla model 3'}, 'House 2': {'Name': 'Arnold', 'CarModel': 'ford f150'}}
Answered: {'House 1': {'Name': 'Eric', 'CarModel': 'tesla model 3'}, 'House 2': {'Name': 'Arnold', 'CarModel': 'ford f150'}}
SUCCESS
lgp-test-2x2-27
Received response from OpenAI API with model qwen/qwen3-32b
ChatCompletion(id='chatcmpl-59a944dc-49c0-4350-9e01-bee3b22d691b', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n    "reasoning": "From Clue 2, Arnold is not in the second house, so Arnold must be in House 1. From Clue 1, Arnold loves pop music. Therefore, Eric must be in House 2 and loves rock music, since there are only two unique music genres and Arnold already has pop.",\n    "solution": {\n        "House 1": {\n            "Name": "Arnold",\n            "MusicGenre": "pop"\n        },\n        "House 2": {\n            "Name": "Eric",\n            "MusicGenre": "rock"\n        }\n    }\n}', refus

{'reasoning': 'From Clue 2, Arnold is not in the second house, so Arnold must be in House 1. From Clue 1, Arnold loves pop music. Therefore, Eric must be in House 2 and loves rock music, since there are only two unique music genres and Arnold already has pop.',
 'solution': {'House 1': {'Name': 'Arnold', 'MusicGenre': 'pop'},
  'House 2': {'Name': 'Eric', 'MusicGenre': 'rock'}}}

Expected: {'House 1': {'Name': 'Arnold', 'MusicGenre': 'pop'}, 'House 2': {'Name': 'Eric', 'MusicGenre': 'rock'}}
Answered: {'House 1': {'Name': 'Arnold', 'MusicGenre': 'pop'}, 'House 2': {'Name': 'Eric', 'MusicGenre': 'rock'}}
SUCCESS
lgp-test-2x2-29
Received response from OpenAI API with model qwen/qwen3-32b
ChatCompletion(id='chatcmpl-8f48f3a1-d6a3-4b2e-bc11-d5b89e1e0049', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n    "reasoning": "From Clue 1, we know Eric is the tea drinker. From Clue 2, the tea drinker cannot be in House 2, so Eric must be in House 1. Since there are only two houses and two people, Arnold must be in House 2. The only remaining drink for Arnold is water.",\n    "solution": {\n        "House 1": {\n            "Name": "Eric",\n            "Drink": "tea"\n        },\n        "House 2": {\n            "Name": "Arnold",\n            "Drink": "water"\n        }\n    }\n}', refusal=None, role='assistant', ann

{'reasoning': 'From Clue 1, we know Eric is the tea drinker. From Clue 2, the tea drinker cannot be in House 2, so Eric must be in House 1. Since there are only two houses and two people, Arnold must be in House 2. The only remaining drink for Arnold is water.',
 'solution': {'House 1': {'Name': 'Eric', 'Drink': 'tea'},
  'House 2': {'Name': 'Arnold', 'Drink': 'water'}}}

Expected: {'House 1': {'Name': 'Eric', 'Drink': 'tea'}, 'House 2': {'Name': 'Arnold', 'Drink': 'water'}}
Answered: {'House 1': {'Name': 'Eric', 'Drink': 'tea'}, 'House 2': {'Name': 'Arnold', 'Drink': 'water'}}
SUCCESS
Passed 100% (3 of 3) in 0:00:01.482519
Results saved to results\20250614_235655_qwen-qwen3-32b-none@groq_2x2.json


In [None]:
from openai import OpenAI
from config import GOOGLE_API_KEY

client = OpenAI(
  api_key=GOOGLE_API_KEY,
  base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

models = client.models.list()
for model in models:
  print(model.id)

models/embedding-gecko-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-flash-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-2.5-pro-exp-03-25
models/gemini-2.5-pro-preview-03-25
models/gemini-2.5-flash-preview-04-17
models/gemini-2.5-flash-preview-05-20
models/gemini-2.5-flash-preview-04-17-thinking
models/gemini-2.5-pro-preview-05-06
models/gemini-2.5-pro-preview-06-05
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-preview-image-generation
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-2.0-pro-exp
models/gemini-2.0-pro-exp-02-05
models/gemini-exp-1206


In [None]:
import zebra_utils

import config
config.update_environ()

models = [
    #'qwen/qwen3-32b-none@groq',
    #'qwen/qwen3-32b-default@groq',
    #'qwen-qwq-32b@groq', # reasoning_effort not supported, always have <think>
    #'deepseek-r1-distill-llama-70b@groq', # reasoning_effort not supported, always have <think>
    #'llama-3.3-70b-versatile@groq', # reasoning_effort not supported
    #'llama3-8b-8192@groq', # reasoning_effort not supported
    #'gemma2-9b-it@groq', # reasoning_effort not supported
    #'allam-2-7b@groq', # reasoning_effort not supported
    'nvidia/llama-3.1-nemotron-70b-instruct@nvidia',
    'nvidia/llama-3.1-nemotron-ultra-253b-v1@nvidia',
    #'gemini-2.5-flash-preview-05-20-none@google', # reasoning_effort: none, low, medium, high
    #'gemini-2.5-flash-preview-05-20-low@google', # reasoning_effort: none, low, medium, high
    #'gemini-2.5-flash-preview-05-20-high@google', # reasoning_effort: none, low, medium, high
    #'gemini-2.5-flash-preview-05-20-default@google', # reasoning_effort: none, low, medium, high, default (auto)    
]

zebra_utils.zebra_run(models, '3*3', count=10, max_tokens=1024*8)

Testing 10 items 3*3 on qwen/qwen3-32b-none@groq
lgp-test-3x3-4
Received response from OpenAI API with model qwen/qwen3-32b
ChatCompletion(id='chatcmpl-1faaf63e-317d-47e7-9f3f-ae4f08b14ad9', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='```json\n{\n    "reasoning": "From Clue 5, Arnold enjoys gardening. Clue 3 says Peter is the cooking enthusiast, and Clue 2 states the photography enthusiast is the tea drinker. Clue 4 says Eric is not in the third house, so he must be in either House 1 or House 2. Clue 1 indicates Arnold is to the left of the person who likes milk. Clue 6 says the photography enthusiast and the person who likes milk are next to each other. Testing combinations, if Arnold is in House 1 (gardening), Peter in House 2 (cooking), and Eric in House 3 (photography), it satisfies all clues. Eric must drink tea (Clue 2), Peter drinks water (since Arnold is left of the milk drinker), and Arnold drinks milk.",\n    "solution"

{'qwen/qwen3-32b-none@groq': 50,
 'qwen/qwen3-32b-default@groq': 100,
 'qwen-qwq-32b@groq': 60,
 'deepseek-r1-distill-llama-70b@groq': 100,
 'llama-3.3-70b-versatile@groq': 30,
 'llama3-8b-8192@groq': 0,
 'gemma2-9b-it@groq': 20,
 'allam-2-7b@groq': 0,
 'gemini-2.5-flash-preview-05-20-none@google': 80,
 'gemini-2.5-flash-preview-05-20-low@google': 10,
 'gemini-2.5-flash-preview-05-20-high@google': 100,
 'gemini-2.5-flash-preview-05-20-default@google': 100}