# 4×4 Sudoku with Only Row/Column/Block Losses

This notebook demonstrates why row, column, and block losses alone do not solve the 4×4 Sudoku.
With uniform logits the constraint losses already drop to zero, yet the puzzle is unsolved because:

- the givens are not enforced, and
- the probabilities keep high entropy instead of collapsing to single digits.

The example mirrors the puzzle used in the README.

## Setup
We start with zero logits so each cell has a uniform probability over the four digits.
The given clues are only recorded for evaluation; they are **not** part of the loss.

In [None]:
import torch
import torch.nn.functional as F

# Fixed 4×4 Sudoku clues (1-based values; zero means empty)
puzzle = torch.tensor([
    [0, 0, 0, 4],
    [0, 2, 0, 0],
    [1, 0, 0, 0],
    [0, 0, 3, 0],
])
given_mask = puzzle > 0

# Start with zero logits -> uniform probabilities after softmax
digits = 4
Z = torch.zeros(4, 4, digits)
P = F.softmax(Z, dim=2)
print(P[0, 0])  # one cell is uniform


## Row/Column/Block losses
With uniform probabilities, the Sudoku constraints already yield zero loss:

In [None]:
def block_sums(P):
    # Reshape into 2×2 blocks and sum within each block over rows/cols
    return P.view(2, 2, 2, 2, digits).sum(dim=(1, 3))

row_sum = P.sum(dim=1)
col_sum = P.sum(dim=0)
blocks = block_sums(P)

L_row = ((row_sum - 1.0) ** 2).sum()
L_col = ((col_sum - 1.0) ** 2).sum()
L_block = ((blocks - 1.0) ** 2).sum()

print(f"L_row={L_row.item():.4f}, L_col={L_col.item():.4f}, L_block={L_block.item():.4f}")


## What the optimizer would see
Because all three losses are already zero, gradient descent would stop immediately. Yet the grid neither respects the givens nor commits to single digits.

## Adding a givens loss
Row/column/block losses ignore the clues. A givens loss keeps each preset cell close to its target digit.
Here we use a simple mean-squared error against the one-hot digit for every given.

In [None]:
# Build one-hot targets for all cells (zeros stay unused for loss)
targets = F.one_hot(puzzle.clamp(min=1) - 1, num_classes=digits).float()

# Penalize deviation from the given digits only where clues exist
L_givens = ((P[given_mask] - targets[given_mask]) ** 2).sum()
total_loss = L_row + L_col + L_block + L_givens

print(f'Givens loss: {L_givens.item():.4f}')
print(f'Total loss with givens: {total_loss.item():.4f}')


With uniform probabilities, the givens loss is the only non-zero term.
It forces optimization to move the preset cells toward their clues,
so the combined loss no longer stalls at zero. An entropy or sharpening
term is still needed to collapse each cell to a single digit.

In [None]:
# The "solution" implied by argmax is arbitrary (always digit 1 here)
solution = P.argmax(dim=2) + 1

# Check how many givens are satisfied
correct_givens = (solution[given_mask] == puzzle[given_mask]).sum().item()
total_givens = given_mask.sum().item()

# Measure mean entropy per cell to show probabilities are not one-hot
entropy = -(P * P.log()).sum(dim=2).mean()

print(solution)
print(f"Givens satisfied: {correct_givens}/{total_givens}")
print(f"Mean entropy per cell: {entropy.item():.3f} nats")


## Takeaway
Row, column, and block losses alone cannot solve the puzzle.
They are already minimized by a uniform distribution that ignores the givens.

To actually solve the Sudoku we still need:
- a **givens loss** to force the clues, and
- an **entropy or sharpening term** (or low temperature) so each cell collapses to one digit.