**Building Tic Tac Toe Game using Reinforcement learning**


First we will start by importing libraries we need:

   Numpy: used for numerical operations, array manipulation, and mathematical functions in Python.

   pickle: allows us to serialize and deserialize Python objects, which is useful for saving and loading data structures.

Define two constants that will be used to create and manipulate the game board

In [20]:
import numpy as np
import pickle
# Define BOARD_Rows and BOARD_Cols
BOARD_Rows = 3
BOARD_Cols = 3

Now let's start with the **state class** to represents the current configuration of the game board. It encapsulates all relevant information needed to describe the game state at any given moment


**Essentially,** the state class holds data about where Xs and Os are placed on the board, which squares are empty, and whose turn it is (whether it’s Player X or Player O)

In [21]:
class state:
    def __init__(self,p1,p2,board):
        self.board=np.zeros((BOARD_Cols,BOARD_Rows))
        self.p1=p1
        self.p2=p2
        self.isEnd=False
        self.boardHash=None
        self.board=board
        #init p1 player first
        self.playerSymbol=1
       #get unique hash of current board state
    def gethash(self):
        self.boardHash=str(self.board.reshape(BOARD_Cols,BOARD_Rows))
        return self.boardHash
    def winner(self):
        #ROW
        for i in range (BOARD_Rows):
            if sum(self.board[i,:])==3:
                self.isEnd=True
                return 1
            if sum(self.board[i,:])==-3:
                self.isEnd=True
                return -1
         #column
        for j in range(BOARD_Cols): #loop for column check
            if sum(self.board[:,j])==3:
              self.isEnd=True
              return 1
            if sum(self.board[:, j]) == -3:
               self.isEnd = True
               return -1
         #Diagonal
        diag_sum1=sum([self.board[i,i]for i in range (BOARD_Cols)])
        diag_sum2=sum([self.board[i,BOARD_Cols-i-1]for i in range(BOARD_Cols)])
        diag_sum=max(diag_sum1,diag_sum2)
        if diag_sum==3 or diag_sum2 == 3:  # Check for 'x' win (1)
            self.isEnd=True
            return 1
        if diag_sum==-3 or diag_sum2 == -3:  # Check for 'o' win (-1)
            self.isEnd=True
            return -1
         #No available positions
        if len(self.availablepositions())==0:
            self.isEnd=True
            return 0
        #Not End
        self.isEnd=False
        return None
    def availablepositions(self):
        positions=[]
        for i in range(BOARD_Rows):
            for j in range(BOARD_Cols):
                if self.board[i,j]==0:
                    positions.append((i,j))#Need to be tuple
        return positions
    def updatestate(self,positions):
        self.board[positions]=self.playerSymbol
        #Switch to another player
        self.playerSymbol=-1 if self.playerSymbol==1 else 1
    #Only when game Ends!
    def giveReward(self):
        result=self.winner()
        #Back propogate reward
        if result==1:
            self.p1.feedReward(1)
            self.p2.feedReward(2)
        elif result==-1:
            self.p1.feedReward(0)
            self.p2.feedReward(1)
        else:
            self.p1.selfReward(0.1)
            self.p1.selfReward(0.5)
      #Board reset
    def reset(self):
        self.board=np.zeros((BOARD_Rows,BOARD_Cols))
        self.boardHash=None
        self.isEnd=False
        self.playerSymbol=1
    def play(self,rounds=100):
        for i in range (rounds):
            if i%1000==0:
                print("Rounds{}".format(i))
            while not self.isEnd:
                #player1
                positions=self.availablepositions()
                p1_action=self.p1.chooseAction(positions,self.board,self.playerSymbol)
                #Take action and update board state
                self.updatestate(p1_action)
                board_hash=self.gethash()
                self.p1.addstate(board_hash)
                #Check board state if it's end
                win=self.winner()
                if win is not None:
                    #self.showBoard()
                    #ended with p1 either win or draw
                    self.giveReward()
                    self.p1.reset()
                    self.p2.reset()
                    self.reset()
                    break
                else: #player2
                    positions=self.availablepositions()
                    p2_action=self.p2.chooseAction(positions,self.board,self.playerSymbol)
                    self.updatestate(p2_action)
                    board_hash=self.gethash()
                    self.p2.addstate(board_hash)
                    win=self.winner()
                    if win is not None:
                        self.giveReward()
                        self.p1.reset()
                        self.p2.reset()
                        self.reset()
                        break
    #play with human
    def play2(self):
        while not self.isEnd: #player1
            positions=self.availablepositions()
            p1_action=self.p1.chooseAction(positions,self.board,self.playerSymbol)
            #Take Action and update state
            self.updatestate(p1_action)
            self.ShowBoard()
                    # Check if p1_action is None
            if p1_action is None:
               print("No valid moves for player 1. It's a tie!")
               self.reset()
               break
            #check board status if it's end
            win=self.winner()
            if win is not None: #check if the game has Ended
                if win==1:
                    print(self.p1.name,"Wins!")
                elif win==-1:
                  print(self.p2.name,"Wins!") #print the correct winner
                else:
                    print("Tie")
                self.reset()
                break
            else: #player2
                positions=self.availablepositions()
                p2_action=self.p2.chooseAction(positions)
                self.updatestate(p2_action)
                self.ShowBoard()
                        # Check if p1_action is None
                if p2_action is None:
                   print("No valid moves for player 2. It's a tie!")
                   self.reset()
                   break
                win=self.winner()
                if win is not None:
                    if win==-1:
                        print(self.p2.name,"Wins!")
                    elif win==1:
                      print(self.p1.name,"Wins!")
                    else:
                        print("Tie!")
                    self.reset()
                    break
    def ShowBoard(self):
        #p1:X p2:O
        for i in range(0,BOARD_Rows):
            print("----------")
            out="|"
            for j in range(0,BOARD_Cols):
                token=" "
                if self.board[i,j]==1:
                    token="x"
                if self.board[i,j]==-1:
                    token="o"
                out+=token+"|"
            print(out)
        print("----------")

our next step is to start with **player class** it represents an individual player—either ‘X’ or ‘O.’ It encapsulates information about the player’s identity, behavior, and moves during the game

**the player class** is essential for managing interactions between players and the game board

In [22]:
class player:
    def __init__(self,name,exp_rate=0.3):
        self.name=name
        self.states=[] #record all positions taken
        self.lr=0.2
        self.exp_rate=exp_rate
        self.decay_gamma=0.9
        self.states_value={} #states-->value
    def getHash(self,board):
        boardHash=str(board.reshape(BOARD_Cols*BOARD_Rows))
        return boardHash
    def chooseAction(self,positions,current_board,symbol):
        if not positions:  # Check if positions is empty
           return None  # Return None to indicate no move possible

        if np.random.uniform(0,1)<=self.exp_rate:
            #Take random Action
            idx=np.random.choice(len(positions))
            action=positions[idx]
        else:
            value_max=-999
            for p in positions:
                next_board=current_board.copy()
                next_board[p]=symbol
                next_boardHash=self.getHash(next_board)
                value=0 if self.states_value.get(next_boardHash)is None else self.states_value.get(next_boardHash)
                #print("value",value)
                if value>=value_max:
                    value_max=value
                    action=p
         # If no action found (positions empty), return a default or raise an exception.
        # In this case, we've already handled empty positions, so we can safely return action
            return action
    def addstate(self,state):
        self.states.append(state)
        #At the end of game ,backpropagate and update states value
    def feedReward(self,reward):
        for st in reversed(self.states):
            if self.states_value.get(st)is None:
                self.states_value[st]=0
            self.states_value[st]+=self.lr*(self.decay_gamma*reward-self.states_value[st])
            reward=self.states_value[st]
    def reset(self):
        self.states=[]
    def savepolicy(self):
        fw=open("policy"+str(self.name),"wb")
        pickle.dump(self.states_value,fw)
        fw.close()
    def loadpolicy(self,file):
        fr=open(file,"rb")
        self.states_value=pickle.load(fr)
        fr.close()

Now let's dive into **Human class**

In [23]:
class Humanplayer:
    def __init__(self,name):
        self.name=name
    def chooseAction(self,positions):
        while True:
            row=int(input("Input your action row: "))
            col=int(input("Input your action col: "))
            action=(row,col)
            if action in positions:
                return action
    #Append a hash state
    def addstate(self,state):
        pass
    def feedReward(self,reward):
        pass
    def reset(self):
        pass


**Last but not least** we created a 2D Numpy array This array represents the Tic-Tac-Toe board, where each cell can be empty (0), contain an ‘X’, or an ‘O

we created two player instances **p1,p2**

we created an instance of the state class (st).

Assuming that the state class handles game logic, it will manage the game state transitions, player moves, and checking for a winner or a draw

we initiated a training loop with st.play(50).
This loop runs 50 rounds of the game (assuming each round is a complete game of Tic-Tac-Toe).

In [24]:
board = np.zeros((BOARD_Rows, BOARD_Cols))
p1 = player("p1")
p2 = player("p2")
st = state( p1,p2,board) # Assuming 'board' is intended as the initial board state
print("Training....")
st.play(50)

Training....
Rounds0


**Finally**:

we created two players:

p1: This player is named “computer” and has an exploration rate of 0 (meaning it won’t explore random moves during training).

p2: This player is a human player named “Human.

**Saving and Loading Policies:**
Before initializing p1, we saved its policy using p1.savepolicy().
Later, we loaded the policy from a file called “policyp1” using p1.loadpolicy("policyp1").

In [25]:
p1.savepolicy()
p1=player("computer",exp_rate=0)
p1.loadpolicy("policyp1")
p2=Humanplayer("Human")
board = np.zeros((BOARD_Rows, BOARD_Cols))
st=state(p1,p2,board)
st.play2()

----------
| | | |
----------
| | | |
----------
| | |x|
----------
Input your action row: 0
Input your action col: 2
----------
| | |o|
----------
| | | |
----------
| | |x|
----------
----------
| | |o|
----------
| | | |
----------
| |x|x|
----------
Input your action row: 2
Input your action col: 0
----------
| | |o|
----------
| | | |
----------
|o|x|x|
----------
----------
| | |o|
----------
| | |x|
----------
|o|x|x|
----------
Input your action row: 1
Input your action col: 1
----------
| | |o|
----------
| |o|x|
----------
|o|x|x|
----------
Human Wins!
