### **Code Description**
This code focuses on solving the N-puzzle problem using Reinforcement Learning methods.
- **Methods**
    - `async_value_iteration`: Asynchronous value iteration where state values are updated when the agent visits those states. The agent follows an epsilon-greedy policy while taking actions.
    - `async_value_iteration_with_stack`: Same as above. However, the agent also keeps a stack of the states it has visited during an episode and updates the values of the states in the reverse order at the end of each episode.
    - `n_step_TD`: n-step Temporal Difference
- **Implementation Details**
    - The agent receives a reward of -1 for every action it takes except for the action that takes it to the goal state -for which it gets a large positive reward.
    - Reward shaping has been used to help the agent find the goal state faster.
    - `Manhattan distance + linear conflict` has been used as a potential function.
    - By setting `update_on_increase` to be `True`, you can force the agent to update the state values only when the new value is higher than the previous one. In my experiments, this has increased the training speed significantly.

In [7]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
import utils
from environment import Environment
from agent import Agent

In [3]:
N = 3
max_episodes = 300
max_steps = 5000
learning_rate = 1.0
epsilon_start = 0.6
epsilon_end = 0.6
default_state_value = 0.0
epsilon_decay_type = "linear"   # Should be "linear" or "exponential"
update_on_increase = True

env = Environment(N)

agent = Agent(env)

initial_state_string = utils.get_random_state_string(N)

agent.async_value_iteration_with_stack(initial_state_string, 
                                       max_episodes, 
                                       max_steps, 
                                       learning_rate, 
                                       epsilon_start, 
                                       epsilon_end, 
                                       default_state_value, 
                                       epsilon_decay_type, 
                                       update_on_increase)

Higher potential reached: -21
8  2  7  
0  6  5  
3  4  1  
Higher potential reached: -20
8  2  7  
3  6  5  
0  4  1  
Higher potential reached: -19
2  6  7  
8  5  0  
3  4  1  
Higher potential reached: -18
2  6  0  
8  5  7  
3  4  1  
Higher potential reached: -17
2  0  6  
8  5  7  
3  4  1  
Higher potential reached: -16
0  2  6  
8  5  7  
3  4  1  
Higher potential reached: -15
7  6  1  
3  4  0  
5  2  8  
Higher potential reached: -14
4  3  6  
7  0  8  
5  1  2  
Higher potential reached: -13
4  3  6  
7  8  0  
5  1  2  
Higher potential reached: -12
4  8  3  
7  6  2  
5  1  0  
Higher potential reached: -11
2  3  5  
8  4  6  
1  0  7  
Higher potential reached: -10
2  3  5  
1  8  6  
0  4  7  
Higher potential reached: -9
2  3  5  
1  8  6  
4  0  7  
Higher potential reached: -8
6  2  3  
1  0  5  
7  8  4  
Higher potential reached: -7
6  2  3  
1  5  0  
7  8  4  
Higher potential reached: -6
1  6  3  
4  0  5  
7  2  8  
Higher potential reached: -5
1  6  3  
4  2 

In [39]:
N = 3
max_episodes = 200
max_steps = 2000
learning_rate = 1.0
epsilon_start = 0.1
epsilon_end = 0.1
default_state_value = 0.0
theta = 1e-3

env = Environment(N)

agent = Agent(env)

initial_state_string = utils.get_random_state_string(N)

agent.async_value_iteration(initial_state_string, max_episodes, max_steps, learning_rate, epsilon_start, epsilon_end, default_state_value, theta)

State is not valid!


In [15]:
N = 7                                        # Size of the puzzle
n = 10                                     # n of n-step TD
max_episodes = 100
max_steps = 100000                             # Maximum number of steps per episode
epsilon_start = 0.65
epsilon_end = 0.65
default_state_value = 0.0
update_on_increase = True                   # Update state values only when the new estimate is greater than the old estimate
epsilon_decay_type = "linear"
plus_value_iteration = True                 # Add one-step Bellman update (the one used in value iteration) to n-step TD
plus_value_iteration_with_stack = True      # Repeat the one-step Bellman update on states visited in the episode in reverse order


env = Environment(N)

agent = Agent(env)

initial_state_string = utils.get_random_state_string(N)

agent.n_step_TD(initial_state_string, 
                n, 
                max_episodes, 
                max_steps, 
                epsilon_start, 
                epsilon_end, 
                default_state_value, 
                update_on_increase, 
                epsilon_decay_type, 
                plus_value_iteration, 
                plus_value_iteration_with_stack)

Higher potential reached: -211
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 35 32 11 
4  18 40 14 28 1  33 
31 3  5  21 9  13 30 
37 36 43 41 12 29 0  
22 46 7  15 48 39 17 
Higher potential reached: -210
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 35 32 11 
4  18 40 14 28 1  33 
31 3  5  21 9  13 0  
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
Higher potential reached: -209
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 35 32 11 
4  18 40 14 28 1  0  
31 3  5  21 9  13 33 
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
Higher potential reached: -208
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 0  32 11 
4  18 40 14 35 28 1  
31 3  5  21 9  13 33 
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
Higher potential reached: -207
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 32 0  11 
4  18 40 14 35 28 1  
31 3  5  21 9  13 33 
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
Higher potential reached: -206
8  23 45 20 16 6  19 
27 34 26 2  10 0  47 


KeyboardInterrupt: 

In [None]:
if __name__ == "__main__":
    N = 3
    n = 1000
    max_episodes = 80
    max_steps = 1000
    epsilon_start = 0.4
    epsilon_end = 0.01
    default_state_value = 0.0
    update_on_increase = True
    epsilon_decay_type = "linear"
    plus_value_iteration = True
    plus_value_iteration_with_stack = True 
    process_count = 10

    utils.parallel_processing(process_count, N, n, 
                                                max_episodes, 
                                                max_steps, 
                                                epsilon_start, 
                                                epsilon_end, 
                                                default_state_value, 
                                                update_on_increase, 
                                                epsilon_decay_type,
                                                plus_value_iteration, 
                                                plus_value_iteration_with_stack)

In [None]:
N = 3
n = 50
max_episodes = 300
max_steps = 1000
epsilon_start = 0.5
epsilon_end = 0.01
default_state_value = 0.0


env = Environment(N)

agent = Agent(env)

initial_state_string = utils.get_random_state_string(N)

agent.n_step_TD_2(initial_state_string, n, max_episodes, max_steps, epsilon_start, epsilon_end, default_state_value)

In [16]:
agent.exploit(initial_state_string)

8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 35 32 11 
4  18 40 14 28 1  33 
31 3  5  21 9  13 30 
37 36 43 41 12 29 0  
22 46 7  15 48 39 17 
- step: 1 -
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 35 32 11 
4  18 40 14 28 1  33 
31 3  5  21 9  13 0  
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
- step: 2 -
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 35 32 11 
4  18 40 14 28 1  0  
31 3  5  21 9  13 33 
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
- step: 3 -
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 35 32 11 
4  18 40 14 28 0  1  
31 3  5  21 9  13 33 
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
- step: 4 -
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 35 32 11 
4  18 40 14 0  28 1  
31 3  5  21 9  13 33 
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
- step: 5 -
8  23 45 20 16 6  19 
27 34 26 2  10 24 47 
25 42 38 44 0  32 11 
4  18 40 14 35 28 1  
31 3  5  21 9  13 33 
37 36 43 41 12 29 30 
22 46 7  15 48 39 17 
- step: 6 -
8  2