# Optimization for 9x9 Sudoku
As disussed in Chapter 02 we will now extend our Sudoku - Puzzle from 4x4 to 9x9.
We will check whether the puzzle is solved per:
- `entropy_mean` which should converge to 0
- `check_grid`which checks if the Sudoku Rules are fullfilled

All needed functions are defined in `/source`

## 1.0 Initialization of 9x9 Grid
- we have our puzzle which creates our Optimizer Tensor Z
- Z will be hand over to 

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import os, sys

# In Colab: make sure you're in the repo folder
if "google.colab" in sys.modules:
    !git clone https://github.com/Silverlode76/sudoku-ai-tutorial.git
    %cd sudoku-ai-tutorial

# Ensure repo root is on Python path
sys.path.insert(0, os.getcwd())

print("cwd =", os.getcwd())
print("source exists =", os.path.isdir("source"))


torch.set_printoptions(precision=4, sci_mode=False)
device = torch.device("cpu")

from source import probs_from_logits, pretty_grid_from_probs, optimize_sudoku, loss_dict, print_losses, check_grid
from source import entropy_mean

dim = 9
grid = torch.tensor([
    [5, 3, 0, 0, 7, 0, 0, 0, 0],
    [6, 0, 0, 1, 9, 5, 0, 0, 0],
    [0, 9, 8, 0, 0, 0, 0, 6, 0],

    [8, 0, 0, 0, 6, 0, 0, 0, 3],
    [4, 0, 0, 8, 0, 3, 0, 0, 1],
    [7, 0, 0, 0, 2, 0, 0, 0, 6],

    [0, 6, 0, 0, 0, 0, 2, 8, 0],
    [0, 0, 0, 4, 1, 9, 0, 0, 5],
    [0, 0, 0, 0, 8, 0, 0, 7, 9],
], dtype=torch.long)

# Build givens mask + targets (targets are digit-1 in [0..3])
givens_mask = grid != 0
givens_target = (grid-1).clamp(min=0)

Z = torch.zeros((dim,dim,dim), dtype=torch.float32, device=device)

# Strongly bias given cells (large positive logit for the correct digit, negative for others)
high = 6.0
low = -6.0
for r in range(dim):
    for c in range(dim):
        if givens_mask[r,c]:
            k = int(givens_target[r,c].item())  # 0..3
            Z[r,c,:] = low
            Z[r,c,k] = high


grid, givens_mask, givens_target, Z

**Interpretation**  
All needed input variables (Z, givens_mask and givens_target) for ADAM are defined 

## 2.0 Start Optimization 
Now we start the ADAM Optimizer with needed parameters as:
- Iteration steps
- T_start
- T_end
- lr (learning rate)
- block_size 3

At the end we will output some Graphs which shows that the loss function converges to 0

In [None]:
T_start = 1.2
T_end = 0.7
lr = 0.3
block_size = 3

Z_final, hist = optimize_sudoku(
    Z, givens_mask, givens_target,
    steps=100,
    lr=lr,
    T_start=T_start,
    T_end=T_end,
    block_size=block_size,
    w_row=1.0,
    w_col=1.0,
    w_blk=1.2,
    w_giv=2.0,
    w_ent=0.01,
    save_P_every=10,
)

P_final = probs_from_logits(Z_final, T=T_end)

plt.figure()
plt.plot(hist["L_total"])
plt.title("Total loss over iterations")
plt.xlabel("iteration")
plt.ylabel("L_total")
plt.show()

plt.figure()
plt.plot(hist["L_ent"])
plt.title("Entropy loss over iterations")
plt.xlabel("iteration")
plt.ylabel("L_ent")
plt.show()

plt.figure()
plt.plot(hist["T"])
plt.title("Temperature schedule")
plt.xlabel("iteration")
plt.ylabel("T")
plt.show()


**Interpretation**  
We observe that the total_loss diverges to 0 at step = 20.
But we do not know if the solution at step = 20 is correct or not.

Let's take a closer look to this by checking every 10th step

In [None]:
for step, P in hist["P_snapshots"].items():
    mean_entropy = entropy_mean(P)
    #ent = -(P * (P + 1e-9).log()).sum(dim=2)   # (9,9)
    print(step, "mean_entropy=", mean_entropy)
    check_grid(P)
    conf = P.max(dim=2).values                 # (9,9)
    print(step, "mean_maxP=", conf.mean().item(), "min_maxP=", conf.min().item())
    print("---------------------------------------\n")



## Inspecting the optimization trajectory (`P_snapshots`)

During optimization we store intermediate probability tensors `P` in `hist["P_snapshots"]`.
Each snapshot represents the model’s current belief distribution over digits for every cell.

The loop below prints a compact set of diagnostics for each stored step:

- **Mean entropy (`mean_entropy`)**  
  Entropy measures uncertainty.  
  - high entropy → probabilities are spread out (the model is unsure)  
  - low entropy → probabilities are peaked (the model is confident)

  Tracking the *mean* entropy over all 81 cells tells us whether the solution is becoming more “decisive” over time.

- **Constraint check (`check_grid(P)`)**  
  This is a sanity check: we convert the current probabilities to a discrete grid (typically via `argmax`) and verify whether row/column/block rules are satisfied (or how badly they are violated).  
  Even when the final solution is not perfect yet, the violations should generally decrease across steps.

- **Confidence statistics (`mean_maxP` and `min_maxP`)**  
  `P.max(dim=2).values` yields, for each cell, the probability of the currently most likely digit.  
  - `mean_maxP` → average confidence across the grid  
  - `min_maxP` → the “weakest” cell (least confident argmax)

  This is useful because a Sudoku often fails due to a small number of ambiguous cells:
  even if most cells become confident, one or two low-confidence cells can still break a row/col/block.

Overall, these diagnostics help us separate two questions:
1) Are we becoming **more confident**? (entropy ↓, maxP ↑)  
2) Are we becoming **more correct** w.r.t. Sudoku constraints? (check_grid improves)


## Summary

In this notebook we demonstrated how a Sudoku can be formulated as a **differentiable constraint satisfaction problem**.

Key takeaways:

- A Sudoku grid `(r × c)` can be extended into a probability tensor `P (r × c × k)`, where each cell represents a distribution over possible digits.
- Sudoku rules (row, column, and block uniqueness) can be expressed as **soft constraints** and combined into a loss function.
- Given digits are enforced via a mask, anchoring the optimization to the known clues.
- A single optimization step already moves the probability mass toward a valid solution.
- No search, backtracking, or discrete solver is required — the structure emerges from the constraints.

This approach highlights an important idea:
**Sudoku solving can be seen as continuous optimization guided by structure, rather than discrete trial-and-error.**

In the next steps, we will:
- iterate the optimization until convergence,
- analyze failure cases,
- and discuss practical stabilization techniques.
