# Assignment 4 Voluntary part: Deep Reinforcement learning

(MMS131, 2023)

Contributors: Jonatan Wårdh, Mats Granath, and Oleksandr Balabanov

### Questions or topics that should be addressed as part of the assignment are marked Q.

## Introduction 

_This assignment is a continuation of the mandatory Assignment 4, but we will now use a neural network to store the $Q$-matrix. This assignement requires that you learn how to build a basic convolutional neural network using Tensorflow. There is any number of sources for this on the web, so it should not be very difficult. Apart from that, it make take some experimenting with the parameters and your network to get the training to work._ 

In the previous assignment we found a $Q$-matrix that described the values of taking the action up,down,left or right given the state we were in. The state was simply given by the position of the player who lived on a $20\times 30$ grid corresponding to $600$ possible states; the state space. The $Q$-matrix was quite small ($600\times 4$) and we had no problem storing all the values and go through them many times and update until the values converged. However, this smallness of the state space is not expected in general.

In this assignment we will include a little twist to the game, which makes the state space size explode. What we will do is to let the fire spread with a certain probability after every turn of the player. For the sake of relative simplicity we will consider a $10\times 10$ grid. The player has $100$ possible positions but the fire can occupy any grid, or not, meaning that we have $100 \cdot 2^{100} \approx 10^{32}$ possible states. This is a huge number and there is no way of even storing this number of $Q$-matrix entries, even less so going through and updating all these values repeatedly. Of course, in principle the problem sounds quite simple. There should be no need to tune $10^{32}$ degrees of freedoms to learn the simple task of just avoiding the fire, but exactly how to capture that intuition in a self learning mathematical framework is less clear. (An alternative could be to use a rule based solution, but here we want the agent to learn the rules itself.) This is where the neural network comes in. As we have seen before in the course, a neural network can learn general features of data. For the present problem it amounts to taking the state as an input and giving the value of the different actions as an output, i.e. we use the network to represent the $Q$-matrix. As you will see you will actually only need a very small number of parameters (compared to $10^{32}$) in your network to solve this problem.

The ground breaking paper that popularized deep Q-learning: [Human-level control through deep reinforcement learning](https://www.nature.com/articles/nature14236)


### Training

What we need to alter from the previous assignement is to implement the network as the $Q$-function. The update of the $Q$-function: $Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha(r+\gamma\max_{a'}Q(s',a'))$, must then be replaced with the learning step of the network. This step meant that $r+\gamma\max_{a'}Q(s',a')$ was the new estimate of $Q(s,a)$, thus using the neural network we will then train with $t=r+\gamma\max_{a'}Q(s',a')$ as target for $Q(s,a)$. In practice, the network will have the state $s$ as input, and the 4 movement actions as output, with $t$ the training target for one of these actions.

So far everything seems like a quite straight forward generalization of the previous assignment. However, training the neural network for this purpose is a quite tricky business. We will review a few problems and cures below. 


###  Catastrophic forgetting and experience replay

The first problem one might encounter when using the network instead of simply storing all the values of the $Q$-matrix is that of catastrophic forgetting. This is easiest explained with an example; Let's assume that the player has the goal on its left hand side and the cliff on its right hand side. It now takes a step to the left and receive a reward for this, the network is then trained on this situation and learns to correlates some  feature of this state to the action of taking a step to the left. In the next game it might happen so that the player ends up with the goal to the right and the cliff to the left. With the network being trained on this the previous move it might see some common features of these two states and decides to make a move to the left, because it was what it had learned last time. However this time this results in a negative reward and is we now train on this event it is likely that the network erase what it previously learned. In this scenario we might go back and forth between these two events and not learn anything. 

Note that this could not happen if we had the complete $Q$ matrix entries for every single state, simply because the experiences are disjoint; updating one element of the $Q$-matrix will not effect any other value. However, for the network, training on one state will effect the output of another state. So the very property that make networks good for treating a large state space make them sensitive in this regard.

In order to solve this we need to make the network learn to tell the qualitative difference between the two states and actions it apparently thought looked quite similar. To do this we to train on both experiences simultaneously or at least repetively. In practice this is done by setting up a memory in which we store a certain number of the  recent experiences, possibly going back quite far. When we come to the training we draw a random sample from this memory and train on it. This is called _experience replay_.

### Policy and target networks

Another issue is that the network may become unstable and the training might start to diverge. This problem comes partly from the fact that we use the network in order to predict future rewards, rewards that in the beginning are completely random. This means that the network will learn from its own prediction which can lead to a runaway situation. A cure for this is to use two networks: a _policy_ network and a _target_ network. These networks should have exactly the same architecture, they are just updated in different pace. The policy network determines the action of the player and is the network which is trained in the training step. The target network determines the target of the training, that is predicts $\max_{a'}Q(s',a')$, and is updated less frequently. The target network is updated by copying the weights of the policy network and assigning them to the target network, so the target network is never trained, it is just a copy of the policy network. By delaying the feedback to the target values in the training step the instability might be avoided.




# Assignment

You will be provided with a code that implements the grid world game. Your task is to make a suitable network, and find suitable parameters. There is not a lot of coding required. Below follows a brief description of the code.


## Code:

The code defines a class GridWorld. Classes are a standard tool of object oriented programming languages such as Python. If you want you can read up on classes check [w3schools](https://www.w3schools.com/python/python_classes.asp) for a short version and [The Python tutorial](https://docs.python.org/2/tutorial/classes.html) for a more extensive one, but it should not be necessary to solve the assignment. The main point is that you treat objects (or instances) of that class. You do this by calling the constructor to get an object of the class, e.g. below we write <code>world = GridWorld()</code> to get the object <code>world</code> of the class. Now you can call any function of variable in the class by <code>world.variable</code> or <code>world.function(parameters)</code>.


### State representation

To represent the state we have used an array consisting of three $10\times 10$ grid layers. The first specifies the position of the player with a 1 at the position of the player, otherwise zero, and if the player has walked outside the grid all elements are zero. The second layer specifies the position of the fire, with 1 where there is a fire and 0 otherwise. The third layer represents the goal and is fixed at the same position. I.e. <code>state[:,:,0]</code> gives the grid describing the position of the player, <code>state[:,:,1]</code> position of the fire and <code>state[:,:,2]</code> position of the goal.

## Hints


Below are some hints. _Read these before you start, and come back to them if your code runs but the training doesn't work._ Note that the parameter values stated are not necessarily the best values, it is just there to help you search in an appropriate parameter range.

$\bullet$ Your network should not need more than 100000 trainable parameters (we get a good agent using 15000 parameters). Remember, the output should be four real numbers corresponding to the four action values. (This is not a classification task for which softmax output and cross entropy loss are standard.) We have provided appropriate activation function for the last layer and corresponding loss function. Be careful if you change these.  

$\bullet$ To test the minimal requirements on your network you might want to try to disable the fire spread and perhaps also set <code>gamma = 0</code>, then then the network should learn the rewards of every square in the grid. 

$\bullet$ The output diagnostic of q_max and q_min gives an indication whether the network output is reasonable or not. The range of Q-values should be in the range of possible returns.   

$\bullet$ Beware of overfitting. If you train the network too much on a set of experiences it may be difficult to divert it to new data. Not having a too large fraction of the replay memeory as the training batch may help. Adding regularizers <code> kernel_regularizer </code> may also help. You can also change the number of epochs in the training module <code> replay </code>. 

$\bullet$ Make sure that use use big enough experience replay buffer. We need experience from quite many games in the past, on the order of 100, how many moves does the player do in each game? Considering this, how big should your memory then be. The <code>batchSize</code> is the number of experiences that are averaged over in each training instance of the network. The larger it is the more stable the training, but it also reduces the stochasticity which is important for good training. (Trial and error may be needed to find a good size.) 

$\bullet$ In general <code>gamma</code> needs to be quite large in order for the player to see the goal from far away in the grid. If your training becomes unstable for large <code>gamma</code> you should consider having a bigger memory. You can also try to make the synchronization between policy and target networks less frequent. 

$\bullet$ You will probably need to use a decay of <code>epsilon</code>, so that the player start out walking random but start listening more and more to the network. However, make sure that you do not quench <code>epsilon</code> to fast. A good idea might be to study how how the player is progressing as <code>epsilon</code> is lowered. If he never finds the goal when you reduced <code>epsilon</code> significantly, you probably reduces it to fast.

$\bullet$ You can consider whether to start at random position or at a fixed position, <code> random_start </code>, depending on how far you training has progressed. 

$\bullet$ It is a good idea to interrupt the training and assess how the learning progresses by studying the state value function. You should start seeing that the state values near the goal become higher.

$\bullet$ You will need to train for a few thousand games. However, you should be able to see progress after 1000 games.




# Import libraries and defining GridWorld

In [1]:
# Standard libraries
import numpy as np
import random

# Libraries for plotting
import matplotlib  
import matplotlib.pyplot as plt 
from mpl_toolkits import mplot3d
from IPython.display import display, clear_output

# For safe copy of varaibles
import copy

# Nice way of building a memory, a list with maximum size
from collections import deque
import itertools

# Import TensorFlow
import tensorflow as tf
from tensorflow import keras

#Import the Keras layers etc
from tensorflow.keras.models import clone_model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, Conv2D, MaxPooling2D, Dropout,BatchNormalization
from tensorflow.keras.callbacks import Callback
from tensorflow.keras import regularizers
from tensorflow.keras.models import load_model


In [2]:
# construct Gridworld class
class GridWorld:
     
    ##============ CONSTRUCTOR =============
    # This creates the instance of the class and is called by GridWorld().
    # The first argument in defining any function refers 
    # to the objects of the class currently handeled, often called self, but it could be anything.
    # Note that when calling the functions these first argument is left out, i.e. you only write GridWorld(), 
    # the first argument is automatically fed by python.
    def __init__(self):
        # the size of the grid
        self.size = np.array([10,10])
        # number of layers in state 
        self.layers = 3
        
        # Default starting position and goal
        self.start = np.array([1,8])
        self.goalpos = np.array([8,1])      
        
        # rewards, gravel refers to an ordinary step 
        self.cliff = -100
        self.fire = -50
        self.goal = 100
        self.gravel = -1
                
        # Default values for network
        self.gamma = 0       
        # Probability of wind
        self.wind = 0
        #probability for fire to spread
        self.prob_spread = 0
        
        # Default values for epsilon greedy, not optimal, updated further down!
        self.epsilon = 1
        self.epsilon_decay = 0.99999999999999999999999999
        self.epsilon_min = 0.1
        
        # Memory, default values, not optimal, updated further down!
        self.memory_size = 1 
        self.memory = deque(maxlen=self.memory_size)
        self.batchSize = 1 
        

        

##============ CREATE STATES =============
    # Constructs the state, random_placement is either True or False. If True the player is placed 
    # randomly, if False the player is initialized in the starting position
    
    def make_state(self,random_placement):
        
        if random_placement:
            r_x = np.random.randint(self.size[0]) 
            r_y = np.random.randint(self.size[1])
            # if random = goal keep generating values 
            while r_x == self.goalpos[0] and r_y == self.goalpos[1]:
                r_x = np.random.randint(self.size[0]) 
                r_y = np.random.randint(self.size[1])
        else :
            r_x = self.start[0]
            r_y = self.start[1]
        
        # Initialize all values in all layers to zero
        state = np.zeros((self.size[0],self.size[1],self.layers))
        # we will use player_coordinate to keep track of the position of the player
        player_coordinate = [0,0]

        # Go through all layers and put 1 at the correct position
        for x in range(self.size[0]) : 
            for y in range(self.size[1]) :
            
                # Player, first layer
                if x == r_x and y == r_y :
                    state[x,y,0] = 1
                    player_coordinate[0] = x
                    player_coordinate[1] = y
                else : 
                    state[x,y,0] = 0

                       
                # Fire, second layer
                if (1<= x <=2) and (1<= y <= 2):
                    state[x,y,1] = 1    
                else :
                    state[x,y,1] = 0 
            
                #Goal, thrid layer
                if x == self.goalpos[0] and y == self.goalpos[1] :
                    state[x,y,2] = 1
                else : 
                    state[x,y,2] = 0                
    
        # return state and player_coordinate
        return state , player_coordinate 
    
    
##============ MAKING MOVES IN GRIDWORLD ================
 
    # This function returns the new state, player position, reward of the move and a variable done
    # which tells us if the game is done or not. The arguments are the current state, action, player_coordinate
    # and is_wind which takes values True or False and determines if wind should be implemented.
    
    def make_move(self,state,action,player_coordinate,is_wind):
        # Use deepcopy to make a copy of the state, otherwise this would just be a pointer to the same
        # object as state. This is an inconvenience with python... 
        next_state = copy.deepcopy(state)
                
        if is_wind :
            if np.random.rand() < self.wind:
                # overwrite action with random action internally
                action = np.random.randint(4)
                
        new_x = player_coordinate[0]
        new_y = player_coordinate[1]
        
        # Assume that the player goes out of the board ,set old position to zero
        # and new coordinate to none
        next_state[new_x,new_y,0] = 0
        next_player_coordinate = None
        done = True
        reward = self.cliff
        
        # make move 
        if action < 2 : # up or down
            if action == 0: # up
                new_y = new_y + 1
            else : # down
                new_y = new_y - 1 
        else : # left or right
            if action == 2: # left 
                new_x = new_x - 1
            else : # right
                new_x = new_x + 1   

        # If inside grid 
        if 0<= new_x < self.size[0] and 0<= new_y < self.size[1] :
            # if it hits the goal
            if state[new_x,new_y,2] == 1:
                done = True
                reward = self.goal
                next_state[new_x,new_y,0] = 1
                next_player_coordinate = [new_x,new_y]
            # fire # WHAT IF THE GOAL BURNS?
            elif state[new_x,new_y,1] == 1 :
                done = False
                reward = self.fire
                next_state[new_x,new_y,0] = 1
                next_player_coordinate = [new_x,new_y]
            # gravel    
            else : 
                done = False 
                reward = self.gravel
                next_state[new_x,new_y,0] = 1
                next_player_coordinate = [new_x,new_y]
            
        # else, do nothing, next_player coordinate remains None and next_state[:,:,0] remains all zeros
            
        return next_state, next_player_coordinate , reward , done  

    


    
#============== SPREADING OF FIRE =====================

    # This function takes state and returns, new_state in which the fire has spread. There is 
    # a possibility that the fire did not spread, therefore it gives did_fire_spread which is True if the 
    # fire actually did spread, otherwise it is False.
    
    def let_fire_spread(self,state):
        new_state = copy.deepcopy(state)
        
        # Assume fire did not spread
        did_fire_spread = False
    
        for x in range(self.size[0]):
            for y in range(self.size[1]):
                # Walk trhough the fire-grid, if encountering fire, see if it spreads
                if state[x,y,1] == 1 : 
                    # with the probability self.prob_spread the fire spreads 
                    if np.random.rand() < self.prob_spread :
                        # a random move , no diagonal moves are allowed
                        if np.random.rand() < 0.5 :
                            x_step = np.random.randint(2)*2-1
                            y_step = 0
                        else :
                            x_step = 0
                            y_step = np.random.randint(2)*2-1
                        # if within boundaries
                        if (0<= x+x_step < self.size[0]) and (0<= y+y_step < self.size[1]):
                            # Check that this square is not allready on fire
                            if new_state[x+x_step,y+y_step,1] == 0: 
                                # LET IT BURN!
                                new_state[x+x_step,y+y_step,1] = 1
                                did_fire_spread = True
                                
        
        return new_state , did_fire_spread
    
    
    
##============ TRAINING AND EXPERIENCE REPLAY ============= 
  
    # Replay implements the training and experience replay. Here we want to show two different implementations
    #, one of which is currently commented away in the main loop. The commented version, version 2, we use the
    # target network to predict new targets for old states. In version 1, the currently used version,
    # we recall the old predictions of the targets. We find that version 1 is more stable
    # than version 2 for this problem. But you can try out both versions.
    def replay(self, policy_model):

        # check if the memory is bigger than the batch size
        if len(self.memory) < self.batchSize :
            # if not recall whole memory
            minibatch = self.memory
        else :
            # otherwise take a random batch of size batchSize of the memory
            minibatch = random.sample(self.memory, self.batchSize)
   
        # initialize a state and a target batch for training
        state_batch = np.zeros((len(minibatch),self.size[0],self.size[1],self.layers))
        target_batch = np.zeros((len(minibatch),4))

        # Go through memory 
        i = 0
        for (state, q_state, action, reward, next_state, next_q_max, done) in minibatch :
            # Version 1:
            #target values for network are the same as the output of the network at the time the experience was made
            target = q_state.reshape(4) 
            #except for the action that was actually taken where we have a reward that we can use
            new_target = reward         
            if not done :
                # Version 1:
                new_target = reward + self.gamma * next_q_max
                
            target[action] = new_target
            # Put state and target in the training batch
            state_batch[i] = state
            target_batch[i] = target
            i = i + 1
                        
        # Train, the number of epochs for training can be changed 
        policy_model.fit(state_batch, target_batch, batch_size = len(minibatch), epochs=5, verbose=2)
            
 
            
            
##====================FOR DISPLAY==============================    

    # Used to display the grid
    def make_RGB_grid(self,state,path):
        grid_RGB = np.ones((self.size[0],self.size[1],3))*0.7 #
        
        if path is not None :
            for i,location in enumerate(path):
                grid_RGB[location[0],location[1],:] = np.array([0,0,0]) # black'P' #player
    
        for x in range(self.size[0]) : 
            for y in range(self.size[1]) :
            
                if state[x,y,2]==1:
                    grid_RGB[x,y,:] = np.array([245/255,237/255,48/255]) # Yellow
                
                if state[x,y,1]==1:
                    grid_RGB[x,y,:] = np.array([203/255,32/255,40/255]) # Red '-' #pit    
   
                if state[x,y,0]==1:
                    grid_RGB[x,y,:] = np.array([0/255,254/255,0/255]) # Green '-' #pit    
   
        return grid_RGB

# Define the network

Here you need to set up and compile the Q-network, in Tensorflow and Keras. We have given a skeleton of the network. Fill in and change as you like. You may want to add regularizing, batchnorm layers etc. Input should be the shape of the state matrix, output the 4 action values corresponding to the 4 allowed actions of the agent. The activiation in the final layer and the loss function used are important to get right, and these are already given. 

In [None]:
# define the network

def setup_network(world) : 
    # setup network, world.size and world.layers give the input_shape
    opt=tf.losses.MeanSquaredError()  #loss function 
    model=Sequential()
    model.add(Conv2D(32, kernel_size=(3,3),padding='valid', activation='relu', input_shape=(world.size[0],world.size[1],world.layers)))
    
    
    ...
    
    # to transition from convolutional to dense layer, you need to Flatten, i.e. transform matrix to 1D array
    
    model.add(Dense(4, activation='linear'))
              
    # compile network          
    model.compile(loss=opt, optimizer='adam' ,metrics=['mse'])
    
    # return model 
    return model
    

### Q. 4.2.1 Describe briefly the layers in you network and their functionality (1p)
What are the input and output dimensions in each layer, etc.  

# Setup network and GridWorld

In [None]:
# Setup GridWorld and Q-networks

# Create world from GridWorld
world = GridWorld()
# Setup network
policy_model = setup_network(world)
#make a target network as well
target_model = setup_network(world)
# copy weights from policy to target
target_model.set_weights(policy_model.get_weights())

# Plot model summary
policy_model.summary()

#Make state 
state , player_coordinate = world.make_state(False)

# plot it 
grid_RGB =world.make_RGB_grid(state,None)
#
fig=plt.figure(figsize=(10, 10), dpi= 80, facecolor='w', edgecolor='k')
# We have to invert the x and y axis , go over to numpy array instead
plt.imshow(np.swapaxes(np.array(grid_RGB),0,1))
#plt.axis('on')
plt.gca().invert_yaxis()
plt.xticks(np.arange(0, world.size[0], dtype=int))
plt.yticks(np.arange(0, world.size[1], dtype=int))
plt.show()


# Train network on model 

HERE YOU NEED TO DEFINE PARAMETER VALUES

In [None]:
# Setup system parameters

world.gamma = ?
world.epsilon_decay = ? #multiplicative factor that reduces epsilon each step, for no reduction use 1
world.epsilon = ?  #initial value of epsilon 
world.wind = 0.1
# fire spreading
world.prob_spread = 0.5 # 

#update the target network every "update_target_network_period game". Updating target network less often should make
#the system more stable, but also convergence slower
update_target_network_period = ?

#define size of experience replay buffer (how many moves are stored for training) 
#and batchsize (how many moves from memory buffer are used in each training instance)
world.memory_size = ? 
        
world.batchSize = ? 

world.memory = deque(maxlen=world.memory_size)   #The experience replay memory


Diagnostics. These are used to store the max and min, q-values output by the network for the states visited since the last time the target network was updated. If the training has converged these should correspond quite well to the maximal and minimal rewards available in the game. 

In [None]:
#For training diagnostics.  
q_max = 0
q_min = 0


The main loop. It should be all set to run if you have defined the network and parameters above. 

In [None]:
# MAIN LOOP with network

step_count=0;
random_start = True  #Easier to train from random start
is_wind = False  #Set to false to simplify training 
next_player_coordinate = None
nr_games = 5000  #This is large, you will probably not need as many. Initially make shorter runs to check progress
# loop over games
for games in range(nr_games):
    
    # Display
    print("Game #: %s" % (games,))
    print("Epsilon : %7.4f" % world.epsilon)     
    print("Step count : %s" % step_count) 
    print("End pos %s" % next_player_coordinate)
    
    # DIAGNOSTICS 
    print("Since updated target: Qmin  %s Qmax %s" % (q_min,q_max))
    
        
    # reinitize grid every game will be created at start position
    state , player_coordinate = world.make_state(random_start)
    
    step_count=0;
    while True :
        step_count+=1
        
        # use policy network to get q
        q_state = policy_model.predict(state.reshape(1,world.size[0],world.size[1],world.layers))        
        # get best action
        action = np.argmax(q_state)   
        
        # epsilon greedy
        if np.random.rand() < world.epsilon :
            # take another action
            action=np.random.randint(4)
            
        # make the move
        next_state ,next_player_coordinate, reward , done = world.make_move(state,action,player_coordinate,is_wind)
                  
        # find max q of the next state using target network
        next_q_max = np.amax(target_model.predict(next_state.reshape(1,world.size[0],world.size[1],world.layers)))
        
        # Store in memory
        world.memory.extend([(state, q_state, action, reward, next_state, next_q_max, done)])  
        
        # DIAGNOSTICS UPDATE   =============
        if q_max < np.amax(q_state):
            q_max = np.amax(q_state)
        if q_min > np.amin(q_state):
            q_min = np.amin(q_state)
            
        
        #=============================================
        
          
        # break if done or two many steps taken 
        if done or (step_count > 400): # 10^2 =100 steps to diffuse through the lattice
            break
        
        # update state
        state = next_state 
        player_coordinate = next_player_coordinate 

        # let fire spread 
        new_state, fire_spread = world.let_fire_spread(state)
        if fire_spread :
            state = new_state 
        
    
    # end of game, train the network every 10 games. Can be changed. 

    if (games % 10 == 0 and games>0 ):
        
        world.replay(policy_model)
        
        # epsilon decay
        if world.epsilon > world.epsilon_min:
            world.epsilon *= world.epsilon_decay 
            
    # update target network 
    if (games % update_target_network_period == 0) :
        print("Update target network")
        # update the weights of the target model
        target_model.set_weights(policy_model.get_weights())
        #reset diagnostic
        q_max = 0
        q_min = 0
    
    clear_output(wait=True)    
    # end of loop
    


You can stop the Kernel as you like after a number of games and check progress by using the two plotfunctions below. Restarting the loop will reset the counter but not reset the network. 

### Q. 4.2.2 The main task is to get the training to work and your AI agent to perform. (4p)
A good indication is that the q_max and q_min correspond roughly to the goal and cliff rewards. 

Design and train a Q-networks that solves the game with a fire of probability <code>prob_spread = 0.5</code> and <code>wind = 0</code> for a 10 by 10 grid. Study also the dynamical play of the player. Success implies that the agent (most of the time) manages to move from a random start position to goal (position [8,1]) while (mostly) avoiding the fire.

In [None]:
# Save model, here you could save and load Q-networks to compare
policy_model.save('my_network.h1')  # 

In [None]:
# load model
policy_model = load_model('my_network.h1')

# Plotting : State value function

In [None]:
# State value function of network, i.e. max over actions of the Q-function

# Initialize state, this means that the fire is in the strating position
state , player_coordinate = world.make_state(False)

# set position to zero
state[player_coordinate[0],player_coordinate[1],0] = 0

# to plot
z= np.zeros((world.size[0],world.size[1]))

# Go through all possible position of the player and calculate the value of the best action
# according to the network
for x in range(world.size[0]) :
    for y in range(world.size[1]) :
            player_coordinate=[x,y]
            state[player_coordinate[0],player_coordinate[1],0] = 1
            q_state = policy_model.predict(state.reshape(1,world.size[0],world.size[1],world.layers)).reshape(4)
            z[x,y] =q_state.max()
            state[x,y,0] = 0
        

# Plot        
plt.figure()
fig=plt.figure(figsize=(10, 10), dpi= 80, facecolor='w', edgecolor='k')
plt.imshow(np.swapaxes(z,0,1))
plt.colorbar()
plt.gca().invert_yaxis()
plt.xticks(np.arange(0, world.size[0], dtype=int))
plt.yticks(np.arange(0, world.size[1], dtype=int))
plt.show()

### Q. 4.2.3 Show that you have trained a network that gives good state values corresponding to a functional agent. (1p)
If the dynamic play below works, this should also work. Discuss briefly how one can see that the state value gives a working agent.  

# Plotting : Dynamic play

To challenge the agent increase the probability of fire spread, and turn on and increase the wind.

In [None]:
# dynamic game replay

for a in range(100):
    # fire spreading
    world.prob_spread = 0.5
    world.wind = 0.1
    is_wind = False


    #get original state  
    state, player_coordinate = world.make_state(True)
    path=np.array([player_coordinate])
    
    # setup figure
    fig=plt.figure(figsize=(10, 10), dpi= 80, facecolor='w', edgecolor='k')

    done = False
    count =0
    while (not done) and (count <20) :
        count = count + 1
        
        # plot it 
        plot_grid = world.make_RGB_grid(state,path)
        plt.imshow(np.swapaxes(np.array(plot_grid),0,1))
        plt.gca().invert_yaxis()
        plt.xticks(np.arange(0, world.size[0], dtype=int))
        plt.yticks(np.arange(0, world.size[1], dtype=int))
        # clear figure and wait
        clear_output(wait=True)
        # time.sleep(1)
        display(fig)

        # find action
        q_state = policy_model.predict(state.reshape(1,world.size[0],world.size[1],world.layers))
        # get best action, no epsilon greedy
        action = np.argmax(q_state)      
        # make the move
        next_state ,next_player_coordinate, reward , done = world.make_move(state,action,player_coordinate,is_wind)

        # update state 
        state = next_state
        player_coordinate = next_player_coordinate

        if not done :
            path=np.append(path,[player_coordinate],axis = 0)

        # let fire spread
        new_state, fire_spread = world.let_fire_spread(state)
        if fire_spread :
            print("Fire spread")
            state = new_state 



    plt.close()
