In [None]:
import PIL.Image

In [None]:
!pip install python-dotenv
from dotenv import load_dotenv
import os


Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.1


In [None]:
load_dotenv(".env")
api_key = os.environ['GOOGLE_API_KEY']

In [None]:
import google.generativeai as genai

In [None]:
import time
import base64

In [None]:
genai.configure(api_key=api_key)

In [None]:
def get_gem_res(model_str, input_prompt, image_fp):
  model = genai.GenerativeModel(model_name = model_str)
  image = PIL.Image.open(image_fp)


  contents = [input_prompt, image]

  start = time.time()
  res = model.generate_content(contents=contents)
  end = time.time()

  return res.text, end - start

In [None]:
generic_prompt = '''
    You will be provided an empty KenKen puzzle board, which is a puzzle similar to Sudoku but with mathematical operations. Like Sudoku,
    every row and column must contain the numbers 1 through n, where n is the size of the grid. The thick border lines represent cages,
    which contain a target number and arithmetic operator (+-/*) in the top left cell of each cage. For a given cage, all of the numbers
    that will make up that cage must arrive at the target number through the arithmetic operator. For example in a cage with two cells
    and the symbol 5+, it could be filled in with a 2 and a 3 because 2 + 3 = 5. If there is only one cell in the cage, then it can be
    automatically filled in with the target number.

    Your task is to provide a correct solution to the puzzle provided. The puzzle could have size 3, 4, 5, 6, or 7. All puzzles have at least
    one solution. Explain your reasoning step by step. Format your solution as a 2 dimensional list. An example output for a 3x3 grid is: [[1, 2, 3],[3, 1, 2],[2, 3, 1]]

  '''

In [None]:
models = [
    "gemini-2.0-flash-lite",
    "gemini-2.0-flash",
    "gemini-1.5-flash",
    "gemini-2.5-pro",
    "gemini-2.5-flash"
]

In [None]:
get_gem_res(models[0], generic_prompt, "/content/example3.png")

("Here's the solution to the KenKen puzzle, along with the reasoning:\n\n**Understanding the Puzzle**\n\n*   **Grid Size:** The puzzle is a 3x3 grid.\n*   **Constraints:**\n    *   Each row and column must contain the numbers 1, 2, and 3.\n    *   Numbers within a cage must combine (add, subtract, multiply, or divide) to the target number indicated in the cage.\n    *   Cages cannot share the same number within any row or column.\n\n**Solving the Puzzle**\n\n1.  **6+ cage:** This cage has two cells and needs to sum to 6. The only combination is 3 + 3, however this is not possible as each number can only appear once in each row and column.\n2.  **2÷ cage:** This cage has two cells and needs to result in 2 when divided. The only combination is 2 / 1.\n3.  **6x cage:** This cage has two cells and needs to multiply to get 6. The only combination is 2 * 3.\n4.  **3 cage:** This cage has only one cell, therefore the answer is 3.\n\n**Filling in the Grid**\n\nHere is the solution to the KenKe

Provides a correct solution with incorrect/illogical reasoning

In [None]:
get_gem_res(models[1], generic_prompt, "/content/example3.png")

('Okay, let\'s solve this KenKen puzzle step-by-step.\n\n**Understanding the Puzzle**\n\n*   We have a 3x3 grid, so each row and column must contain the numbers 1, 2, and 3.\n*   Cages indicate groups of cells, a target number, and an operation. The numbers in the cage must result in the target when combined with the operation.\n\n**Solving the Puzzle**\n\n1.  **Bottom Right Cell:** The bottom right cell is a single-cell cage with the value "3". So, we know that grid[2][2] = 3.\n\n2.  **6x Cage:** The "6x" cage spans the center two cells, grid[1][1] and grid[1][0]. The only combination of two numbers from 1, 2, and 3 that multiply to 6 is 2 and 3. Since the row and column will need to each have a 2 and 3, it makes sense that those two cells would be 2 and 3. \n\n3. **2/ cage:** The 2/ cage spans grid[0][1] and grid[0][2]. The two numbers that divide to 2 are 2 and 1. Thus, grid[0][1] and grid[0][2] will be 2 and 1 respectively. \n\n4. **6+ cage:** The 6+ cage spans grid[0][0] and grid[

Incorrectly reads the puzzle and provides a wrong solution

In [None]:
get_gem_res(models[2], generic_prompt, "/content/example3.png")

("Here's how to solve this KenKen puzzle step-by-step:\n\n**1. Single-Cell Cages:**\n\n* The bottom-right cage contains only one cell with the number 3.  This cell must therefore contain the number 3.\n\n**2.  Deduction and Constraints:**\n\n* **Top-left 6+ cage:** This cage has two cells. The only combination of two distinct integers from 1 to 3 that adds up to 6 is 2 + 4, but we only have the numbers 1, 2, and 3 available in this 3x3 grid. This means that one cell must be a 1 and the other a 5. As 5 is not available, we must review our assumptions. Let's look at the other cages.\n\n* **Center-right 2÷ cage:** This also has two cells.  The only pair of integers from 1 to 3 that divides to give 2 is 2 ÷ 1 = 2 or 4 ÷ 2 = 2 (again not feasible). The result must be a number between 1 and 3, and a number must be able to divide to give a number between 1 and 3. Therefore one cell must contain a 2, and another must be a 1.\n\n* **Bottom-middle 6x cage:** This cage has two cells.  The only pa

Provides a correct solution with incorrect/illogical reasoning

In [None]:
get_gem_res(models[3], generic_prompt, "/content/example3.png")

("This is a 3x3 KenKen puzzle. The numbers to be used in each row and column are 1, 2, and 3, with no repeats.\n\nLet's analyze the cages and constraints:\n1.  **Grid size:** 3x3, so numbers are 1, 2, 3.\n2.  **Sudoku Rule:** Each row and column must contain one of each number (1, 2, 3).\n3.  **Cages:** Based on the visual clues and standard KenKen puzzle construction, we can deduce the cages. Note that the provided image contains a known impossibility if interpreted literally (a '6+' cage with two cells in a 3x3 grid), so we must infer the intended, solvable puzzle structure.\n    *   **Cage at (2,2):** A single cell with the number '3'. This is a single-cell cage, so the value is 3. `grid[2][2] = 3`.\n    *   **Cage at (0,0):** '6+'. In a 3x3 grid, a two-cell sum of 6 is impossible (3+3 would violate the column rule). The only logical interpretation is that this is a three-cell cage covering the entire first column. The sum `1+2+3 = 6` satisfies this. So, cells `(0,0)`, `(1,0)`, and 

Offers correct solution and reasoning

In [None]:
get_gem_res(models[4], generic_prompt, "/content/example3.png")

("The KenKen puzzle is a 3x3 grid, meaning each row and column must contain the numbers 1, 2, and 3 exactly once.\n\nHere's a breakdown of the cages and the logic to solve the puzzle:\n\n**1. Interpret Cages from the Image:**\n\n*   **Cage (2,2):** This is a single-cell cage with the number 3.\n    *   `board[2][2] = 3`\n\n*   **Cage (0,1) and (0,2): `2÷`**\n    *   These two cells are in Row 0, Columns 1 and 2.\n    *   They must contain numbers from {1, 2, 3} that divide to 2. The only pair is (1, 2), where one is 1 and the other is 2 (e.g., 2/1 = 2 or 1/2 is not an integer so it must be 2/1). So `board[0][1]` and `board[0][2]` are {1, 2}.\n\n*   **Cage (1,1) and (2,1): `6×`**\n    *   These two cells are in Column 1, Rows 1 and 2.\n    *   They must contain numbers from {1, 2, 3} that multiply to 6. The only pair is (2, 3). So `board[1][1]` and `board[2][1]` are {2, 3}.\n\n*   **Cage (0,0) and (1,0): `6+`**\n    *   These two cells are in Column 0, Rows 0 and 1.\n    *   They must c

Provides a correct solution with illogical and confused reasoning

###Offering puzzle size


Telling the model the size of the puzzle initially

In [None]:
helpful_prompt = '''
    You will be provided an empty 3x3 KenKen puzzle board, which is a puzzle similar to Sudoku but with mathematical operations. Like Sudoku,
    every row and column must contain the numbers 1 through 3. The thick border lines represent cages,
    which contain a target number and arithmetic operator (+-/*) in the top left cell of each cage. For a given cage, all of the numbers
    that will make up that cage must arrive at the target number through the arithmetic operator. For example in a cage with two cells
    and the symbol 5+, it could be filled in with a 2 and a 3 because 2 + 3 = 5. If there is only one cell in the cage, then it can be
    automatically filled in with the target number.

    Your task is to provide a correct solution to the puzzle provided. All puzzles have at least
    one solution. An example solution for a 3x3 grid is: [[1, 2, 3],[3, 1, 2],[2, 3, 1]]

    Explain your reasoning step by step. Present your solution as a 2 dimensional list as shown above, and explain why your solution is correct.
  '''

In [None]:
get_gem_res(models[0], helpful_prompt, "/content/example3.png")

("Here's the solution to the KenKen puzzle and the reasoning behind it:\n\n**Solution:**\n\n```\n[[3, 1, 2],\n [1, 2, 3],\n [2, 3, 1]]\n```\n\n**Reasoning:**\n\n1.  **Analyze Cages:**\n    *   **6+ Cage (2 cells):**  The sum of the two cells in this cage must equal 6. The possible combinations with the numbers 1, 2, and 3 are 3+3 (which is impossible since the numbers can only be used once per row/column) and 1+2+3, so it must be a 3 and a 3 (again, can't use the same number twice.) Therefore, this cage must contain a 3 and some other value.\n    *   **2÷ Cage (2 cells):**  The division of the two cells in this cage must equal 2. The only possible combination using 1, 2, and 3 is 2 / 1 = 2.\n    *   **6x Cage (2 cells):**  The product of the two cells in this cage must equal 6. The only possible combination using 1, 2, and 3 is 2 * 3 = 6.\n    *   **Single-cell cage (value 3):** A single-cell cage must contain the target number.\n\n2.  **Start Filling:**\n    *   Place the 3 from the 3

Incorrect solution & reasoning

In [None]:
get_gem_res(models[1], helpful_prompt, "/content/example3.png")

("Okay, let's solve this KenKen puzzle step-by-step.\n\n**1. Analyze the known values and cages:**\n\n*   We have a '3' in the bottom right corner cell.\n*   We have a '6+' cage in the top-left corner, consisting of two cells. The numbers must add up to 6.\n*   We have a '2/' cage in the top-middle column, consisting of two cells. The numbers must divide to 2.\n*   We have a '6x' cage in the center, consisting of two cells. The numbers must multiply to 6.\n\n**2. Deductions:**\n\n*   **6+ cage:** The possible combinations for 6+ using numbers 1, 2, and 3 are (3, 3), (1,5), (2,4). However since numbers cant repeat in a row or column, the only combination that works for a 3x3 grid is (2+4),(1+5). From our pool of available numbers (1,2,3), this cage could only use (3,3) and since we cant repeat numbers, the only valid combination must be (3+3), (2+4), (1+5). \n*   **2/ cage:** The combinations for 2/ are: 2/1 or 4/2,6/3... However, since we can only use numbers 1-3, the combination must 

Incorrect solution & reasoning

In [None]:
get_gem_res(models[2], helpful_prompt, "/content/example3.png")

('Here\'s how to solve this KenKen puzzle step-by-step:\n\n**1. Single-Cell Cages:**\n\n* The bottom-right cage contains only one cell with the number \'3\'.  This cell must contain the number 3.\n\n**2. Simple Arithmetic:**\n\n* The top-right cage has the constraint "2 ÷". The only way to achieve this with numbers 1-3 is 2 ÷ 1 = 2. Therefore, the top-middle cell is a 2 and the top-right cell is a 1.\n\n**3. Deduction and Constraint Satisfaction:**\n\n*  The top-left cage shows "6 +". We have to find two numbers between 1 and 3 which add up to 6. This can only be 3 + 3. Since each row and column only has one of each number, and the top-right is already a 1, and the top-middle is a 2, this cell is a 3, leaving the remaining cell in the top-left cage a 3.\n\n* The middle-left cage shows "6 x". The only way to get a product of 6 using numbers 1-3 is 2 x 3.  Since the top-middle cell is already 2, the middle-left cell must be 3 and the middle-middle cell must be 2.\n\n* Finally, let\'s fil

Correct solution and reasoning

##Increasing puzzle size + difficulty

In [None]:
simple_prompt = '''
    You will be provided an empty KenKen puzzle board, which is a puzzle similar to Sudoku but with mathematical operations. Like Sudoku,
    every row and column must contain the numbers 1 through n, where n is the size of the grid. The thick border lines represent cages,
    which contain a target number and arithmetic operator (+-/*) in the top left cell of each cage. For a given cage, all of the numbers
    that will make up that cage must arrive at the target number through the arithmetic operator. For example in a cage with two cells
    and the symbol 5+, it could be filled in with a 2 and a 3 because 2 + 3 = 5. If there is only one cell in the cage, then it can be
    automatically filled in with the target number.

    Your task is to provide a correct solution to the puzzle provided. The puzzle could have size 3, 4, 5, 6, or 7. All puzzles have at least
    one solution. Format your response as a 2 dimensional list representing the solution for the puzzle. An example response for a 3x3 KenKen puzzle is:
    [[1, 2, 3],[3, 1, 2],[2, 3, 1]]

  '''

In [None]:
res, res_time = get_gem_res(models[3], simple_prompt, "/content/example4.png")

In [None]:
res

"This is a 3x3 KenKen puzzle. The numbers to be used in each row and column are 1, 2, and 3.\n\nHere's a step-by-step deduction to solve the puzzle:\n\n1.  **Analyze the Cages:**\n    *   The grid is 3x3.\n    *   **Cage 1 (Top-Right):** A single cell with the number `3`. We can immediately fill this in: `grid[2][2] = 3`.\n    *   **Cage 2 (Left Column):** A three-cell cage with the rule `6+`. This cage covers the entire first column. In a 3x3 grid, each column must contain the numbers 1, 2, and 3. The sum 1 + 2 + 3 = 6, which matches the cage's rule. So, the first column must contain the numbers 1, 2, and 3 in some order.\n    *   **Cage 3 (Top Row):** A two-cell cage with the rule `2÷`. This cage covers cells `(0,1)` and `(0,2)`. The numbers in these two cells must result in 2 when one is divided by the other. With the numbers 1, 2, and 3, the only pair that satisfies this is {1, 2} (since 2 / 1 = 2).\n    *   **Cage 4 (Remaining Cells):** The remaining three cells form an L-shaped c

In [None]:
res, res_time = get_gem_res(models[4], simple_prompt, "/content/example4.png")

In [None]:
res

'The provided KenKen puzzle is a 4x4 grid, implying that each row and column must contain the numbers 1, 2, 3, and 4 exactly once. The thick border lines define "cages," where the numbers within the cage must satisfy the given arithmetic operation and target number.\n\nUpon analyzing the puzzle image, several inconsistencies and logical impossibilities arise under standard KenKen rules:\n\n1.  **Overlapping Cages:** The cell at `(1,0)` (second row, first column) appears to be part of two cages: the `7+` cage (with `(0,0)`) and the `12x` cage (with `(2,0)`). A cell in KenKen can only belong to one cage. This indicates a drawing error in the puzzle\'s borders.\n2.  **Impossible "8+" Cages (for N=4):**\n    *   The `8+` cage at `(1,2)` (second row, third column) spans two cells vertically: `(1,2)` and `(2,2)`.\n    *   The `8+` cage at `(3,0)` (fourth row, first column) spans two cells horizontally: `(3,0)` and `(3,1)`.\n    *   For a 4x4 grid using numbers 1, 2, 3, 4, the numbers within 

Both top models determine that no solution exists for 5x5 puzzle, and neither can provide a correct solution for a 4x4 puzzle

##Conclusion


*    gemini-2.0-flash-lite: could solve 3x3 but provided incorrect reasoning
*    gemini-2.0-flash: could solve 3x3 but provided incorrect reasoning
*    gemini-1.5-flash: could solve the 3x3 only when given the size
*    gemini-2.5-pro: could solve the 3x3 but not 4x4 or 5x5
*    gemini-2.5-flash: could solve the 3x3 but not 4x4 or 5x5, 2x faster than 2.5 pro



###Extracting solution from Gemini Response

In [None]:
def extract_solution(response, model_str):
  res = response

  solution = [[]]
  row = 0

  strt = res.rfind("[[")
  end = res.find("]]", strt)
  if strt == -1 or end == -1:
    return None

  for i in range(strt, end):
    if res[i].isdigit():
      solution[row].append(int(res[i]))
    elif res[i] == ']':
      solution.append([])
      row+=1
  return solution

In [None]:
extract_solution(res, models[3])

[[4, 3, 2, 1], [3, 4, 1, 2], [1, 2, 4, 3], [2, 1, 3, 4]]

##Z3 as solution validation

In [None]:
!pip install z3-solver

Collecting z3-solver
  Downloading z3_solver-4.15.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (602 bytes)
Downloading z3_solver-4.15.1.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.5/29.5 MB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: z3-solver
Successfully installed z3-solver-4.15.1.0


In [None]:
from z3 import *

In [None]:
def parse_block_constraints(puzzle, cells):
    constraints = []
    for block in puzzle:
        op = block["op"]
        target = block["target"]
        vars_in_block = [cells[i][j] for i, j in block["cells"]]
        if op == "":
            constraints.append(vars_in_block[0] == target)
        elif op == "add":
            constraints.append(Sum(vars_in_block) == target)
        elif op == "mul":
            product = vars_in_block[0]
            for v in vars_in_block[1:]:
                product *= v
            constraints.append(product == target)
        elif op == "sub" and len(vars_in_block) == 2:
            a, b = vars_in_block
            constraints.append(Or(a - b == target, b - a == target))
        elif op == "div" and len(vars_in_block) == 2:
            a, b = vars_in_block
            constraints.append(Or(a / b == target, b / a == target))
        else:
            raise ValueError(f"Unsupported operation or malformed block: {block}")
    return constraints



In [None]:
def validate_solution(puzzle, size, solution):
  X = [ [ Int("x_%s_%s" % (i+1, j+1)) for j in range(size) ]
      for i in range(size) ]
  cells_c  = [ And(1 <= X[i][j], X[i][j] <= size)
              for i in range(size) for j in range(size) ]
  rows_c   = [ Distinct(X[i]) for i in range(size) ]
  cols_c   = [ Distinct([ X[i][j] for i in range(size) ])
              for j in range(size) ]
  constraints = cells_c + rows_c + cols_c + parse_block_constraints(puzzle, X)
  instance = [
        X[i][j] == solution[i][j]
        for i in range(size)
        for j in range(size)
    ]
  s = Solver()
  problem = constraints + instance
  s.add(problem)
  return s.check() == sat

In [None]:
import json
with open("/puzzles/puzzles_dict.json", "r") as f:
    puzzles_ds = json.load(f)

In [None]:
solution = extract_solution(res, models[3])
validate_solution(puzzles_ds["4"][5], 4, solution)


False

###Evaluating Gemini 2.5 Pro + Flash on Puzzle Dataset

In [None]:
pro_accuracy = {3:0, 4:0, 5:0, 6:0, 7:0}
pro_avg_time = {3:0, 4:0, 5:0, 6:0, 7:0}
pro_responses = {3:[], 4:[], 5:[], 6:[], 7:[]}

In [None]:
flash_accuracy = {3:0, 4:0, 5:0, 6:0, 7:0}
flash_avg_time = {3:0, 4:0, 5:0, 6:0, 7:0}
flash_responses = {3:[], 4:[], 5:[], 6:[], 7:[]}

In [None]:
model_str = models[3]
num_puzzles = 30
input_prompt = simple_prompt
total = 0
size = 7

In [None]:
for i in range(0, min(num_puzzles, len(puzzles_ds[str(size)]))):
  filepath= "/board_images/board"+str(size)+"_"+str(i)+".png"
  res, res_time = get_gem_res(model_str, input_prompt, filepath)
  pro_responses[size].append(res)
  pro_avg_time[size] += res_time

  solution = extract_solution(res, model_str)
  if solution and len(solution)==size and all(len(row) == size for row in solution) and validate_solution(puzzles_ds[str(size)][i], size, solution):

    pro_accuracy[size] += 1

  total+=1
  print(str(pro_accuracy[size])+"/"+str(total))
  time.sleep(5)


0/1
0/2
0/3
0/4
0/5
0/6
0/7
0/8
0/9
0/10
0/11
0/12
0/13
0/14
0/15
0/16
0/17
0/18
0/19
0/20
0/21
0/22
0/23


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 2990.65ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 2427.34ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 16755.16ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 785.23ms


0/24


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 15762.82ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 2453.71ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 17173.72ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 10924.33ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 23016.30ms


0/25


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 5284.12ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 5341.97ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 42789.23ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 33053.03ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 4905.08ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 2478.86ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 16179.44ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateConte

0/26
0/27
0/28
0/29


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 60014.17ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-pro:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 14040.42ms


0/30


In [None]:
pro_avg_time[7] = pro_avg_time[7] / 30
#pro_accuracy[size] = pro_accuracy[size] / total
total = 0

In [None]:
model_str = models[4]
num_puzzles = 10
input_prompt = simple_prompt
total = 0
for size in range(3, 4):
  for i in range(0, min(num_puzzles, len(puzzles_ds[str(size)]))):
    filepath= "/board_images/board"+str(size)+"_"+str(i)+".png"
    res, res_time = get_gem_res(model_str, input_prompt, filepath)
    flash_responses[size].append(res)
    flash_avg_time[size] += res_time

    solution = extract_solution(res, model_str)
    if solution and validate_solution(puzzles_ds[str(size)][i], size, solution):
      flash_accuracy[size] += 1

    total+=1
    print(str(flash_accuracy[size])+"/"+str(total))
    time.sleep(5)
  flash_avg_time[size] = flash_avg_time[size] / total
  flash_accuracy[size] = flash_accuracy[size] / total
  total = 0

0/1
0/2
1/3
2/4
2/5
2/6
2/7
2/8
3/9
4/10


In [None]:
print("Gemini 2.5 Flash 3x3 Results: \nAccuracy: ", flash_accuracy[3], "\nAverage Time: ", flash_avg_time[3])

Gemini 2.5 Flash 3x3 Results: 
Accuracy:  0.4 
Average Time:  95.44324486255645


In [None]:
print("Gemini 2.5 Pro 3x3 Results: \nAccuracy: ", pro_accuracy[3], "\nAverage Time: ", pro_avg_time[3])

Gemini 2.5 Pro 3x3 Results: 
Accuracy:  74 
Average Time:  145.44219735622406


In [None]:
print("Gemini 2.5 Pro 4x4 Results: \nAccuracy: ", pro_accuracy[4], "\nAverage Time: ", pro_avg_time[4])

Gemini 2.5 Pro 4x4 Results: 
Accuracy:  30 
Average Time:  240.86327896595


In [None]:
print("Gemini 2.5 Pro 5x5 Results: \nAccuracy: ", pro_accuracy[5], "\nAverage Time: ", pro_avg_time[5])

Gemini 2.5 Pro 5x5 Results: 
Accuracy:  0 
Average Time:  249.93609276771545


In [None]:
print("Gemini 2.5 Pro 6x6 Results: \nAccuracy: ", pro_accuracy[6], "\nAverage Time: ", pro_avg_time[6])

Gemini 2.5 Pro 6x6 Results: 
Accuracy:  0 
Average Time:  276.3617357635498


In [None]:
print("Gemini 2.5 Pro 7x7 Results: \nAccuracy: ", pro_accuracy[7], "\nAverage Time: ", pro_avg_time[7])

Gemini 2.5 Pro 7x7 Results: 
Accuracy:  0 
Average Time:  279.10149833361305


In [None]:
import pandas as pd

In [None]:
results = pd.DataFrame({
    'accuracy (%)': pro_accuracy,
    'avg_time (s)': pro_avg_time
})

In [None]:
results.to_csv('/results/gemini_evaluation.csv', index=True)

###Example Response

In [None]:
print(flash_responses[3][8])

The puzzle is a 3x3 KenKen grid, which means the numbers 1, 2, and 3 must appear exactly once in each row and each column.

Let's identify the cages and their constraints based on the provided image and standard KenKen rules (clue in the top-left cell of the cage, thick lines define cage borders).

1.  **Cage with "2"**: Cell (1,1) has "2". This is a single-cell cage.
    *   Therefore, `Grid[1][1] = 2`.

2.  **Cage with "3÷"**: Cell (2,0) has "3÷". The thick border indicates this cage includes cells (2,0) and (2,1).
    *   The numbers in these two cells must divide to 3. Possible pairs from {1,2,3} are (3,1), since 3 ÷ 1 = 3. (1 ÷ 3 = 1/3, not 3).
    *   Since (2,0) is the top-left (leftmost) cell of the cage, `Grid[2][0] = 3` and `Grid[2][1] = 1`.

Now, let's fill in these definite values into the 3x3 grid:

```
[?, ?, ?]
[?, 2, ?]
[3, 1, ?]
```

Next, we apply the Sudoku-like rules (each row and column must contain 1, 2, and 3 exactly once):

*   **Row 2**: We have [3, 1, ?]. The 