<a href="https://colab.research.google.com/github/Culmenus/Teymi5-TV2/blob/main/Connect_3_Teymisverkefni_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Teymisverkefni 2**

Í teymisverkefni-2 er markmiðið að læra leikáætlun með því að spila á móti annarri leikáætlun sem er líka að læra (e. both sides are learning). Sjá umræðu í Sutton and Barto (fyrsta kafla) um self-play og tic-tac-toe. 

Hvert teymi skilar **teymiX.npy** skrá með stefnu $\pi(s,a)$ þar sem $s$ er skilgreint  með Zobrist hashing (sjá neðar) ásamt læsilegri og auðskiljanlegri útfærslu á reikniriti.

Frjálst er að velja "reinforcement learning" reiknirit sem við höfum fjallað um í þessari námslotu. Sem dæmi, það má vera TD($0$), MC, TD($\lambda$), $Q-$learning, SARSA($\lambda$), expected SARSA($\lambda$). Svo þarf að spá í hvernig *exploration* er útfært.

Áður en þið byrjið skulum við ræða eftirfarandi æfingar úr Sutton og Barto:

  1. (Exercise 1.1): **Self-Play** Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

  2. (Exercise 1.2): **Symmetries** Some Connect-3 positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

  3. Exercise 1.3: **Greedy Play** Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

  4. Exercise 1.4: **Learning from Exploration** Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

  5. *Exercise 1.5*: **Other Improvements** Can you think of other ways to improve the reinforcement learning player? Can you think of any better way to solve the Connect-3 problem as posed?

In [5]:
import numpy as np

**Connect-3 a mini version of Connect-4 on a 5x5 board**

The following code implements the game Connect-3 on a 5-by-5 board. We will walk through this code in class. I am open for any improvement you may suggest for making the code faster. The Zobrist hashing table is randomly generated using the seed 42. Don't change how we generate the hashed states, I aim to let your policy compete against policies generated by other teams. This final exercise, will be used as input to our discussion on Sutton and Barto's exercise 1.1.

In [6]:
#Sindri Testing
# two player game (1) versus (2)
getotherplayer = lambda p : 3-p # returns the other player
# the initial empty board, in matrix board we store legal indices in board[:,-1]
def iState(n = 5, m = 5):
  return np.zeros((n,m+1), dtype=np.uint16)
# perform move for player on board
def Action(board, move, player):
  if board[move,-2] > 0:
    print("illegal move ", move, " for board ", np.flipud(board.T))
    raise
  else:
    board[move,board[move,-1]] = player # place the disc on board
    board[move,-1] += 1 # next legal drop
  return board
# determine if terminal board state, assuming last move was made by player
def Terminal(board, player, n = 5, m = 5):
  # now we need to see if we can find any 3 in a row
  # here are all the possible ways of doing so on a 3x3 matrix
  swin = np.array(
          [[True,  True,  True,  False, False, False, False, False, False], # these are all possible winning states
           [False, False, False, True,  True,  True,  False, False, False], # for three in a row
           [False, False, False, False, False, False, True,  True,  True],
           [True,  False, False, False, True,  False, False, False, True],
           [False, False, True,  False, True,  False, True,  False, False],
           [True,  False, False, True,  False, False, True,  False, False],
           [False, True,  False, False, True,  False, False, True,  False],
           [False, False, True,  False, False, True,  False, False, True]]
          )
  for i in range(n-2):
    for j in range(m-2): # scan all 3x3 over the board
      b3x3 = np.ones((8,1)) @ board[i:(i+3),j:(j+3)].reshape(1,9) # extract 3x3 segment 
      if np.any(np.sum((b3x3 == player) & swin, axis = 1) == 3): # check if 3 in a row
        return True
  return False
# Some pretty way of displaying the board in the terminal
def pretty_print(board, n = 5, m = 5, symbols = " XO"):
  for num in range(1, n+1):
    print(" " + str(num) + " ", end = " ")
  print()
  for j in range(m):
    for i in range(n):
      print(" " + symbols[board[i,m-1-j]] + " ", end = " ")
    print("")
# let's simulate a single game using pure random play, i.e. demonstrate an episode!
def connect3():
  S = iState() # initial board state
  p = 1 # first player to move (other player is 2)
  a = np.random.choice(np.where(S[:,-2]==0)[0],1) # first move is random
  S = Action(S,int(a),p) # force first move to be random
  p = getotherplayer(p) # other player's turn 
  while True:
    a = np.random.choice(np.where(S[:,-2]==0)[0],1) # pure random policy
    if 0 == len(a): # check if a legal move was possible, else bail out
      return 0, S # its a draw, return 0 and board
    S = Action(S,int(a),p) # take action a and update the board state
    if Terminal(S,p):
      return p, S # return the winning player and board
    p = getotherplayer(p) # other player's turn
  return 0, S # default is a draw

# run demo for random play policy:
winner, board = connect3()
symbols = " XO"
print(" winner is '", symbols[winner],"' final board is:\n")
pretty_print(board)

 winner is ' O ' final board is:

 1   2   3   4   5  
     X       X      
     O   O   X   O  
     X   X   O   O  
 X   O   O   X   X  
 O   X   X   O   O  


**Constructing value and action value tables using [Zobrist hashing](https://en.wikipedia.org/wiki/Zobrist_hashing):**

Please do not modify how the zobTable is generated.


In [7]:
#Sindri Testing 2
# let's all use the same zobTable, so we set the random seed
np.random.seed(42)
zobTable = np.random.randint(1,2**(5*5)-1, size=(5,5,3), dtype = np.uint32)
# compute index from current board state
def computeHash(board, n = 5, m = 5):
  h = 0
  for i in range(n):
    for j in range(board[i,-1]):
      h ^= zobTable[i,j,board[i,j]]
  return h

In [8]:
#Sindri Testing 3
# playing around with these hashed values
maxhashValue = 2**(5*5)
# Initialize Q(s,a) function
Q = np.zeros((maxhashValue,5))
# Access a particular (Q(s,a))
a = 0 # any legal move in {0,1,2,3,4}
player = 1 # now you need to think which player owns these Q values...
Q[computeHash(board),a]
# An afterstate value function (more efficient implementation)
V = np.zeros(maxhashValue)
# Access a particular (V(s_a)), here you need to do a one-step lookahead (afterstate)
hash_key = computeHash(board)
lookaheadBoard = board.copy()
lookaheadBoard = Action(lookaheadBoard,a,player)
lookahead_hash_key = computeHash(lookaheadBoard)
print("(from lookaheadBoard lookahead_hash_key = ", lookahead_hash_key, "hash_key = ", hash_key)
V[lookahead_hash_key]
# Now check this out, the novelty of Zobrist hashing
(i,j) = (a,board[a,-1]) # where we would like to add our player
lookahead_hash_key = hash_key
lookahead_hash_key ^= zobTable[i,j,player] # create move without updating the board!
print("lookahead_hash_key = ", lookahead_hash_key, "hash_key = ", hash_key)
lookahead_hash_key ^= zobTable[i,j,player] # undo the move without updating the board!
print("(undo)lookahead_hash_key = ", lookahead_hash_key, "hash_key = ", hash_key)
# Save your policy, PI to file, see folder icon on left hand side to download!
np.save("teymi5", V)
!ls

(from lookaheadBoard lookahead_hash_key =  18267665 hash_key =  31298052
lookahead_hash_key =  18267665 hash_key =  31298052
(undo)lookahead_hash_key =  31298052 hash_key =  31298052
