**<h2 align="center" style="color:brown;font-size:200%">Lab 2: Tic-Tac-Toe Game MDP in OpenAI Gym</h2>**


# Modeling Tic-Tac-Toe as a Markov Decision Process (MDP) for Reinforcement Learning

## **Introduction:**
Tic-Tac-Toe, a simple yet strategic two-player game, provides a compelling case study for reinforcement learning and Markov Decision Processes (MDPs). In this project, we explore how this classic game can be framed as an MDP, allowing an agent to learn optimal strategies through interaction and feedback. By defining the states, actions, rewards, transitions, and policy, we aim to develop a reinforcement learning model capable of understanding and mastering the game dynamics. Furthermore, leveraging OpenAI Gym for implementation enhances the reusability and experimentation potential of our approach.


## Problem Statement
The challenge is to design and implement a reinforcement learning framework for Tic-Tac-Toe, using MDP principles. The goal is to enable an agent to learn the optimal strategy through iterative play, considering the constraints of the game, such as available moves, game outcomes, and rewards. This study also seeks to integrate the model with OpenAI Gym, a popular library for developing and testing reinforcement learning algorithms.


## Objectives
1. To define the MDP components for Tic-Tac-Toe, including states, actions, rewards, and transitions.
2. To implement a reinforcement learning model capable of learning the optimal policy for playing the game.
3. To integrate the model with OpenAI Gym, creating an environment for experimentation.
4. To analyze the agent's learning process and evaluate its performance using policy and value functions.


In [7]:
import numpy as np
from gym import Env, spaces


class EnhancedTicTacToeEnv(Env):
    """
    Enhanced Tic-Tac-Toe environment with detailed state, action, and reward tracking.
    """
    def __init__(self):
        super(EnhancedTicTacToeEnv, self).__init__()

        # Action space: 9 possible positions (0 to 8)
        self.action_space = spaces.Discrete(9)

        # Observation space: Board with values (-1: O, 0: empty, 1: X)
        self.observation_space = spaces.Box(low=-1, high=1, shape=(3, 3), dtype=int)

        self.reset()

    def reset(self):
        """
        Reset the game board and other variables for a new game.
        """
        self.board = np.zeros((3, 3), dtype=int)
        self.current_player = 1  # Player X starts
        self.done = False
        self.state_history = []  # Track all states
        self.action_history = []  # Track all actions
        self.reward_history = []  # Track all rewards
        return self.board

    def step(self, action):
        """
        Execute a move and transition to the next state.
        """
        row, col = divmod(action, 3)

        # Check for invalid move
        if self.board[row, col] != 0:
            return self.board, -10, True, {"invalid_move": True, "action": action}

        # Update board
        self.board[row, col] = self.current_player
        self.action_history.append((self.current_player, action))

        # Check for a winner or draw
        winner = self._check_winner()
        if winner is not None:
            self.done = True
            if winner == 0:
                reward = 0  # Draw
                self.reward_history.append(reward)
                return self.board, reward, True, {"result": "draw"}
            elif winner == self.current_player:
                reward = 1  # Current player wins
                self.reward_history.append(reward)
                return self.board, reward, True, {"result": "win"}
            else:
                reward = -1  # Opponent wins
                self.reward_history.append(reward)
                return self.board, reward, True, {"result": "lose"}

        # Switch player
        self.current_player *= -1

        # Game continues
        reward = 0  # No reward for non-terminal states
        self.reward_history.append(reward)
        return self.board, reward, False, {"action": action}

    def _check_winner(self):
        """
        Check the board for a winner or if the game is a draw.
        """
        for player in [1, -1]:
            # Check rows, columns, and diagonals
            if any(np.all(self.board == player, axis=0)) or \
               any(np.all(self.board == player, axis=1)) or \
               np.all(np.diag(self.board) == player) or \
               np.all(np.diag(np.fliplr(self.board)) == player):
                return player

        # Check for draw
        if np.all(self.board != 0):
            return 0  # Draw

        return None  # Game is not finished

    def render(self):
        """
        Render the current board state.
        """
        symbols = {0: " ", 1: "X", -1: "O"}
        for row in self.board:
            print("|".join([symbols[cell].center(3) for cell in row]))
            print("-" * 11)
        print()

    def get_valid_actions(self):
        """
        Get all valid actions (empty spaces).
        """
        return np.where(self.board.flatten() == 0)[0]


# Play a more detailed game
def play_detailed_game(env, max_steps=9):
    state = env.reset()
    env.render()

    step_count = 0
    while not env.done:
        valid_actions = env.get_valid_actions()
        if not valid_actions.size:
            print("No valid moves left. Game ends in a draw.")
            break

        # Randomly choose a valid action
        action = np.random.choice(valid_actions)
        state, reward, done, info = env.step(action)

        step_count += 1
        print(f"Step {step_count}:")
        print(f"Player {'X' if env.current_player == -1 else 'O'} places at position {action}")
        env.render()
        print(f"Reward: {reward}, Info: {info}")

        if done:
            if reward == 1:
                print(f"Player {'O' if env.current_player == -1 else 'X'} wins!")
            elif reward == -1:
                print(f"Player {'X' if env.current_player == -1 else 'O'} wins!")
            elif reward == 0:
                print("It's a draw!")
            break

    print("\nGame Summary:")
    print("State History:")
    for i, (player, action) in enumerate(env.action_history):
        print(f"Turn {i+1}: Player {'X' if player == 1 else 'O'} at position {action}")
    print("\nRewards History:")
    print(env.reward_history)


# Play the enhanced game
env = EnhancedTicTacToeEnv()
play_detailed_game(env)


   |   |   
-----------
   |   |   
-----------
   |   |   
-----------

Step 1:
Player X places at position 8
   |   |   
-----------
   |   |   
-----------
   |   | X 
-----------

Reward: 0, Info: {'action': 8}
Step 2:
Player O places at position 6
   |   |   
-----------
   |   |   
-----------
 O |   | X 
-----------

Reward: 0, Info: {'action': 6}
Step 3:
Player X places at position 4
   |   |   
-----------
   | X |   
-----------
 O |   | X 
-----------

Reward: 0, Info: {'action': 4}
Step 4:
Player O places at position 7
   |   |   
-----------
   | X |   
-----------
 O | O | X 
-----------

Reward: 0, Info: {'action': 7}
Step 5:
Player X places at position 1
   | X |   
-----------
   | X |   
-----------
 O | O | X 
-----------

Reward: 0, Info: {'action': 1}
Step 6:
Player O places at position 0
 O | X |   
-----------
   | X |   
-----------
 O | O | X 
-----------

Reward: 0, Info: {'action': 0}
Step 7:
Player X places at position 2
 O | X | X 
-----------
   | X |   
-

In [8]:
import numpy as np

class TicTacToeRewardSystem:
    def __init__(self):
        self.reset()

    def reset(self):
        self.state = np.zeros((3, 3), dtype=int)  # 0 for empty, 1 for X, -1 for O
        self.current_player = 1  # 1: X, -1: O
        self.done = False
        self.x_reward = 0
        self.o_reward = 0

    def render(self):
        symbols = {0: " ", 1: "X", -1: "O"}
        print("\nCurrent Board:")
        for row in self.state:
            print("|".join([symbols[cell].center(3) for cell in row]))
            print("-" * 11)
        print()

    def check_winner(self):
        # Check rows, columns, and diagonals for winner
        for player in [1, -1]:
            if any(np.all(self.state == player, axis=0)) or \
               any(np.all(self.state == player, axis=1)) or \
               np.all(np.diag(self.state) == player) or \
               np.all(np.diag(np.fliplr(self.state)) == player):
                return player
        if np.all(self.state != 0):  # Draw
            return 0
        return None  # No winner yet

    def step(self, action):
        row, col = divmod(action, 3)

        if self.state[row, col] != 0:
            penalty = -5  # Penalty for invalid moves
            if self.current_player == 1:
                self.x_reward += penalty
            else:
                self.o_reward += penalty
            return self.state, f"Invalid move by Player {'X' if self.current_player == 1 else 'O'}. Penalty: {penalty}", False

        # Valid move
        self.state[row, col] = self.current_player
        reward = 1  # Reward for making a move
        reason = "Move made."

        # Check adjacency for strategic placement
        neighbors = [
            (row-1, col), (row+1, col), (row, col-1), (row, col+1),
            (row-1, col-1), (row-1, col+1), (row+1, col-1), (row+1, col+1)
        ]
        adjacency_bonus = any(
            0 <= r < 3 and 0 <= c < 3 and self.state[r, c] == self.current_player
            for r, c in neighbors
        )
        if adjacency_bonus:
            reward += 3
            reason = "Strategic placement towards victory."

        # Check for blocking opponent
        opponent = -self.current_player
        block_bonus = any(
            0 <= r < 3 and 0 <= c < 3 and self.state[r, c] == opponent
            for r, c in neighbors
        )
        if block_bonus:
            reward += 2
            reason = "Blocked opponent's potential win."

        if self.current_player == 1:
            self.x_reward += reward
        else:
            self.o_reward += reward

        # Check for game end
        winner = self.check_winner()
        if winner is not None:
            self.done = True
            if winner == 1:
                return self.state, "Player X wins!", True
            elif winner == -1:
                return self.state, "Player O wins!", True
            else:
                return self.state, "Game ends in a draw.", True

        # Switch player
        self.current_player *= -1
        return self.state, f"Player {'X' if self.current_player == -1 else 'O'}'s turn. Reward: {reward}. Reason: {reason}", False

    def cumulative_rewards(self):
        return f"Cumulative Rewards - Player X: {self.x_reward}, Player O: {self.o_reward}"

# Play the game
def play_game():
    game = TicTacToeRewardSystem()
    game.render()

    for step in range(9):
        valid_moves = [i for i in range(9) if game.state[i // 3, i % 3] == 0]
        if not valid_moves:
            print("No valid moves left. Game ends in a draw.")
            break

        action = np.random.choice(valid_moves)
        state, message, done = game.step(action)
        print(message)
        game.render()
        print(game.cumulative_rewards())

        if done:
            print("Game Over!")
            break

play_game()



Current Board:
   |   |   
-----------
   |   |   
-----------
   |   |   
-----------

Player X's turn. Reward: 1. Reason: Move made.

Current Board:
   |   |   
-----------
   |   |   
-----------
   |   | X 
-----------

Cumulative Rewards - Player X: 1, Player O: 0
Player O's turn. Reward: 1. Reason: Move made.

Current Board:
   | O |   
-----------
   |   |   
-----------
   |   | X 
-----------

Cumulative Rewards - Player X: 1, Player O: 1
Player X's turn. Reward: 1. Reason: Move made.

Current Board:
   | O |   
-----------
   |   |   
-----------
 X |   | X 
-----------

Cumulative Rewards - Player X: 2, Player O: 1
Player O's turn. Reward: 4. Reason: Strategic placement towards victory.

Current Board:
 O | O |   
-----------
   |   |   
-----------
 X |   | X 
-----------

Cumulative Rewards - Player X: 2, Player O: 5
Player X's turn. Reward: 6. Reason: Blocked opponent's potential win.

Current Board:
 O | O |   
-----------
 X |   |   
-----------
 X |   | X 
-----------

## **Conclusion:**
In conclusion, modeling the Tic-Tac-Toe game as a Markov Decision Process (MDP) for reinforcement learning provides a structured approach to solving the game. By breaking the game down into states, actions, rewards, and transitions, we create a dynamic environment where an agent can interact, learn from feedback, and improve its performance. Through iterative learning, the agent can explore various strategies and gradually converge toward an optimal policy that maximizes its chances of winning.

MDPs are particularly well-suited to this problem due to their ability to handle decision-making problems where the outcomes depend on both the agent's actions and the random nature of the environment. Reinforcement learning, in turn, allows the agent to adjust its strategy based on the rewards it receives for each action, encouraging faster and more efficient learning.

Furthermore, by integrating this model with OpenAI Gym, we enhance the environment's flexibility, allowing the reinforcement learning agent to experiment, learn, and be evaluated in a standardized platform that supports a wide range of RL algorithms. This enables the Tic-Tac-Toe agent to continuously improve its strategy, paving the way for more complex and dynamic learning environments in the future.

Ultimately, this project demonstrates the power of MDPs and reinforcement learning in solving structured decision-making problems, such as Tic-Tac-Toe, and highlights their applicability in broader real-world applications, such as robotics, game playing, and automated decision-making systems.
