<a href="https://colab.research.google.com/github/GomathyDhanya/SudokuRL/blob/main/DQNSUDOKU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **REINFORCEMENT LEARNING BASED SOLUTION FOR SOLVING A SUDOKU**

In this notebook we aim to solve a sudoku puzzle using a deep Q-Learning agent who finds an optimal position and the number to place in the position



**Reward Shaping:**

  

*   Correct placement → +100 reward
*   Completing rows/columns/boxes → incremental rewards (+4 for rows/columns, +8 for boxes)
*   Penalties for conflicts and unnecessary steps → encourages efficiency and accuracy.


**State Representation:**



*   Sudoku grid flattened to 81-length vector; CNN layers capture spatial relationships.
*   Original grid stored to visualize changes and highlight agent decisions.



**Action Space:**

*   81 positions × 9 possible numbers = 729 discrete actions.
*   Action is split into position + number via divmod. Intention behind this is to allow the agent to choose both an optimal position and the right number for the position to make future desicions easier.





**Training Strategy:**



*   Multi-attempt approach: agent retries puzzles to improve learning.
*   Epsilon-greedy policy gradually reduces exploration.
*   Experience replay (batch size 64) stabilizes learning.




**Observed Results**
  

*   The agent gradually learns correct placements and can complete easier puzzles after repeated attempts.
*   Green highlights show cells updated by the agent in real-time, illustrating decision-making.



**Learning Dynamics:**



*   Early steps often penalized due to conflicts.
*   Total reward increases with correct placements and puzzle completions over attempts.
*   Epsilon decay reduces random actions, improving convergence.




**Challenges / Limitations:**

*   High-dimensional action space (729 actions) increases training complexity.
*   Puzzle completion is stochastic; agent may need multiple attempts per puzzle.

*   Training convergence can be slow for harder Sudoku puzzles due to sparse rewards.


**Future Improvements:**

*   Use Double DQN or Dueling DQN for more stable learning.

*   Incorporate curriculum learning: start with simpler puzzles, gradually increase difficulty.

*   Optimize CNN architecture to reduce overfitting and improve generalization across unseen puzzles.

In [None]:
#import the necessary libraries

import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from gym import spaces
from copy import deepcopy
import kagglehub
import os
from collections import deque
import random
import gc
from IPython.display import display, clear_output
import matplotlib.pyplot as plt

# Set CUDA or CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")



Using device: cuda


In [None]:
# Download and load dataset
kagglehub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://www.kaggle.com/static/images/site-logo.png\nalt=\'Kaggle…

Kaggle credentials set.
Kaggle credentials successfully validated.


In [None]:
# Load puzzles and solutions
sudoku_puzzles, sudoku_solutions = [], []
with open('./3/sudoku.csv', 'r') as f:
    for line in f.readlines()[1:]:
        puzzle, solution = line.strip().split(",")
        sudoku_puzzles.append([int(i) for i in puzzle])
        sudoku_solutions.append([int(i) for i in solution])

# Convert to tensors and reshape to 9x9 grids
sudoku_puzzles = torch.tensor(sudoku_puzzles, dtype=torch.float32).view(-1, 9, 9).to(device)
sudoku_solutions = torch.tensor(sudoku_solutions, dtype=torch.float32).view(-1, 9, 9).to(device)


In [None]:
#function that returns a random puzzle from the dataset

def get_random_puzzle():
    idx = np.random.randint(len(sudoku_puzzles))
    return sudoku_puzzles[idx], sudoku_solutions[idx]

In [None]:

# Gym environment for Sudoku
class SudokuEnv(gym.Env):
    def __init__(self):
        super(SudokuEnv, self).__init__()
        self.action_space = spaces.Discrete(81 * 9)  # 81 positions × 9 possible numbers
        self.grid = None
        self.solution = None
        self.current_step = 0
        self.max_steps = 5000

    def reset(self):
        self.grid, self.solution = get_random_puzzle()
        self.current_step = 0
        return self.grid

    def check_subgoal_completion(self):
      # Only count rows, columns, or boxes as complete if all entries match the solution
        row_complete = [torch.all(self.grid[row, :] == self.solution[row, :]) for row in range(9)]
        col_complete = [torch.all(self.grid[:, col] == self.solution[:, col]) for col in range(9)]
        box_complete = [
            torch.all(self.grid[row//3*3:(row//3+1)*3, col//3*3:(col//3+1)*3] == self.solution[row//3*3:(row//3+1)*3, col//3*3:(col//3+1)*3])
            for row in range(0, 9, 3) for col in range(0, 9, 3)
        ]
        return row_complete, col_complete, box_complete

    def step(self, action):
        pos, num = divmod(action, 9)  # Convert action to position and number
        num += 1  # Sudoku numbers are 1-9
        row, col = divmod(pos, 9)  # Convert position to row and column

        #The following is the reward structure

        reward = 0.0  # Initialize reward to 0

        if self.grid[row, col] != 0:  # Cell already filled
            reward -= 3
        elif num == self.solution[row, col]:  # Correct placement
            reward += 100  # High reward for correct placement
            self.grid[row, col] = num
            row_complete, col_complete, box_complete = self.check_subgoal_completion()
            reward += sum(row_complete) * 4 + sum(col_complete) * 4 + sum(box_complete) * 8  # Reward for completing rows, columns, and boxes
        else:
            reward -= 2

        reward += -0.1 if self.current_step < self.max_steps * 0.3 else -0.5 #penalizing based on steps to reduce the number of actions

        # Check for conflicts
        if torch.sum(self.grid[row, :] == num) > 1: reward -= 5  # Row conflict
        if torch.sum(self.grid[:, col] == num) > 1: reward -= 5  # Column conflict
        if torch.sum(self.grid[row//3*3:(row//3+1)*3, col//3*3:(col//3+1)*3] == num) > 1: reward -= 5  # Box conflict

        self.current_step += 1
        done = torch.all(self.grid != 0).item() #done if none of the boxes are 0

        if done:
            reward += 200 #High reward for completion

        return self.grid, reward, done, {}

    def render(self, original_grid):

      # Highlight changes by comparing with the original grid
      for i in range(9):
        row = ""
        for j in range(9):
            if self.grid[i, j] != original_grid[i, j]:
                row += f"\033[92m{int(self.grid[i, j])} \033[0m"  # Highlight changed cells in green
            else:
                row += f"{int(self.grid[i, j])} "  # Original cells in default color
        print(row)
      print("\n")


In [None]:

# Deep Q-Network with Convolutional layers
class QNetwork(nn.Module):
    def __init__(self, input_size, output_size):
        super(QNetwork, self).__init__()
        self.conv_layers = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.Flatten()
        )
        self.fc_layers = nn.Sequential(
            nn.Linear(128 * 9 * 9, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, output_size)
        )

    def forward(self, x):
        x = x.view(-1, 1, 9, 9)  # (batch_size, channels, height, width)
        x = self.conv_layers(x)
        return self.fc_layers(x)


In [None]:
# Replay buffer for experience replay
class ReplayBuffer:
    def __init__(self, max_size=10000):
        self.buffer = deque(maxlen=max_size)

    def add(self, experience):
        self.buffer.append(experience)

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        return zip(*batch)

    def size(self):
        return len(self.buffer)

In [None]:

# DQN Agent with replay buffer
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.q_network = QNetwork(state_size, action_size).to(device)
        self.target_network = QNetwork(state_size, action_size).to(device)
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.002)
        self.loss_fn = nn.MSELoss()
        self.gamma = 0.99
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.99

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return np.random.randint(self.action_size)
        with torch.no_grad():
            state = state.to(device)
            q_values = self.q_network(state)
            return torch.argmax(q_values).item()

    def train(self, experience_batch):
        states, actions, rewards, next_states, dones = experience_batch
        states = torch.stack(states).view(-1, self.state_size).to(device)
        next_states = torch.stack(next_states).view(-1, self.state_size).to(device)
        actions = torch.tensor(actions, dtype=torch.long, device=device)
        rewards = torch.tensor(rewards, dtype=torch.float32, device=device)
        dones = torch.tensor(dones, dtype=torch.float32, device=device)

        q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.target_network(next_states).max(1)[0]
        target_q_values = rewards + self.gamma * next_q_values * (1 - dones)

        loss = self.loss_fn(q_values, target_q_values)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay





In [None]:

# Training the agent
def train_and_resolve_multiple_times(agent, replay_buffer, num_attempts=10, max_steps=None, render_frequency=100):
    env = SudokuEnv()

    for attempt in range(num_attempts):
        print(f"\n--- Attempt {attempt + 1} at solving the puzzle ---")

        # Reset environment for a new attempt
        if attempt == 0:
            state = env.reset().flatten().to(device)
            original_grid = env.grid.clone()  # Save the original puzzle
        else:
            env.grid = original_grid.clone()  # Reset to original puzzle for re-solving
            state = env.grid.flatten().to(device)

        done = False
        total_reward = 0
        step_count = 0

        while not done:
            action = agent.act(state)
            next_state, reward, done, _ = env.step(action)
            next_state = next_state.flatten().to(device)

            # Visualize the state every few steps
            if step_count % render_frequency == 0:
                clear_output()  # Render every 'render_frequency' steps
                print(f"Attempt {attempt + 1}, Step {step_count}:")
                env.render(original_grid)  # Pass the original grid to highlight changes

            if replay_buffer.size() > 64:
                agent.train(replay_buffer.sample(64))

            replay_buffer.add((state, action, reward, next_state, done))

            state = next_state
            total_reward += reward
            step_count += 1

            if max_steps and step_count >= max_steps:
                break  # Terminate if max steps are reached

        print(f"Attempt {attempt + 1} completed in {step_count} steps. Total Reward: {total_reward}")

        # If this is not the first attempt, reduce exploration gradually
        agent.epsilon = max(agent.epsilon * agent.epsilon_decay, agent.epsilon_min)



In [None]:
#Initialize the agent and environment for multiple attempts
replay_buffer = ReplayBuffer()
agent = DQNAgent(state_size=81, action_size=81 * 9)

# Run the training and repeated solving process for a single puzzle
train_and_resolve_multiple_times(agent, replay_buffer, num_attempts=10)