# Maze Harvest

 Maze Harvest is an environment where an agent is placed in a 2D grid with randomly spawning fruits. The agent can collect two types of fruits, red fruits which have a power of 10, and green fruits which have a power of 5. 
 
 If the agent does not have down syndrome, when it consumes a red fruit it can grow its body size up to the floor value of (X+Y)/2, where X and Y are the shape of the 2D grid.

The agent has limited visibility and can only see the fruits and walls within a window of n units in four directions - up, down, left, and right. However, the agent has the ability to smell the fruits over the grid in these directions: front left, front right, back left & back right (Like four Venn diagrams, each intersecting with two adjacent sets but not center). 


The goal of the agent is to eat as many fruits as possible and to survive the maze.

### Environment and its limits
> $Avg = \lfloor (X+Y)/2 \rfloor$


- Environment size limit: $10\le X,Y \le 50$, if not under limit size set to 10
- Maximum Fruit Spawn: $Avg$
- Number of Walls: limit $\le 30\%$ of the total cells ($20 - 25\%$ is the best range)
- Maximum Body size: $Avg$
- Default Maximum Moves Alloweded: 10000 (we can change it)
- Action Space: 4, 0 left, 1 up, 2 right, 3 down.
- Default Window Size: 10, window size should be less than or equal to $Avg$, or else default will be used.
- During reset, we can set the wall proportion and enable or disable down syndrome(nds - no down syndrome), default 0.25 and falls.
- State Size 16, One hot encoded direction (4), Danger and Food (8) each 4 directions, Smell of fruits (4).

### Reward System
- default $0 \to Reward$
- if agent hit body or wall or max moves reached $-10 \to Reward$
- else if agent ate a fruit $10*power + Reward \to Reward$
- then the reward poisoned by powers of fruits already in the environment $(Reward - \frac {1}{10}\sum^F_i power_i)  \to Reward$

- If it reaches maximum moves or, body or wall hit, the game is done. 


**File:** `maze_harvest.py`
- `Environment`: Initialize new environment with given parameters.
- `play_frames`: Require lamda function to clear shell, input: recorded frames.
    - play frames example: `play_frames(frames,lambda : clear_output(wait=True),sleep=0.3)`
    
 - **Utils**:
     - Class: `ActionSpace`.
     - Functions: `euclidean`,`gaussian_kernel` & `nxt_direction`
     - Variables: `color_map` & `directions`
     
 **File:** `dqn_tf.py`
  - `DQN`: Requires Architecture & Activation Functions (other parameters are set to default values)
  - **Utils**:
      - Classes: `ReplayMemory` & `QNetwork`

In [1]:
from maze_harvest import Environment, play_frames
from dqn_tf import DQN
import numpy as np




In [2]:
from time import sleep
from IPython.display import clear_output
def play(net,env,slow=0.1,walls=.2,nds=False,record=False,print_now=True):
    nxt_state = env.reset(walls=walls,nds=nds)
    done = False
    if record: env.record(True)
    env.render(print_now)
    while not done:
        state = nxt_state
        sleep(slow)
        action = np.argmax(net(np.array([state])))
        nxt_state,r,done = env.step(action)
        clear_output(wait=True)
        env.render(print_now)
    
    if record:
        return env.record(False)

In [3]:
def train(agent,env,num_episodes=100,batch_size=32,C=100,ep=10,walls=.2,nds=False):
    steps=0
    for i in range(1,num_episodes+1):
        try:
            episode_loss = 0
            t = 0

            # Sample Phase
            agent.decay_epsilon()
            nxt_state = env.reset(walls=walls,nds=nds)
            done = False
            while not done:
                state = nxt_state
                action = agent.e_greedy(state,env)
                nxt_state,reward,done = env.step(action)

                # Learning Phase
                episode_loss += agent.learn((state,action,reward,nxt_state,done),batch_size)
                steps +=1
                t+=1

                if steps % C == 0: agent.update_target_network()

            if i%ep==0: print(f"Episode:{i} Score:{env.score} Moves:{env.move_count} Loss:{episode_loss/t}")
        except KeyboardInterrupt:
            print(f"Training Terminated at Episode {i}")
            return 

## Agent Network Init

Architecture: 16->12($Lin$)->6($reLU$)->4($Lin$)

HyperParameters: `eta = 5e-4, epsilon=0.7,epsilon_min=0.01`

In [4]:
arch = [16,12,8,4]
af = ["linear","relu","linear"]
agent = DQN(arch,af,eta=5e-4,epsilon=0.7,epsilon_min=0.01)




## Two different Environments

env1 = 10x10

env2 = 20x20

In [5]:
env1 = Environment(max_moves=50)

In [6]:
env2 = Environment(20,20,max_moves=100)

## Agent Training

In [7]:
before_train = play(agent.Q,env1,record=True,walls=.25) # before training

[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m
[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;31;41m ! [0m
[1;37;47m   [0m[1;37;47m   [0m[1;32;42m ! [0m[1;37;47m   [0m[0;30;40m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m[1;37;47m   [0m
[1;37;47m   [0m[0;30;40m   [0m[0;30;40m   [0m[1;37;47m   [0m[0;30;40m   [0m[0;30;40m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m
[0;30;40m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m
[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m[0;30;40

### 1) walls 1% (to give more chance to eat fruits)

In [8]:
train(agent,env1,500,32,C=20,ep=100,walls=.01)

Episode:100 Score:8 Moves:50 Loss:1210.0604010009765
Episode:200 Score:12 Moves:50 Loss:1185.5406982421875
Episode:300 Score:9 Moves:50 Loss:1280.5768978881836
Episode:400 Score:2 Moves:27 Loss:1053.5836351182725
Episode:500 Score:0 Moves:8 Loss:1215.3612632751465


In [9]:
play(agent.Q,env1,walls=0.05)

[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;32;42m ! [0m[0;30;40m   [0m
[1;32;42m ! [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m
[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m[1;32;42m ! [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m
[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[0;30;40m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m
[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;32;42m ! [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;32;42m ! [0m
[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47m   [0m[1;37;47

### 2) walls 20% (to learn how to avoid walls and to eat)

In [None]:
train(agent,env1,4000,42,C=50,ep=100,walls=.20)

Episode:100 Score:0 Moves:3 Loss:1129.016845703125
Episode:200 Score:0 Moves:3 Loss:1101.5572102864583
Episode:300 Score:0 Moves:8 Loss:1166.0054626464844
Episode:400 Score:0 Moves:2 Loss:854.2424011230469
Episode:500 Score:0 Moves:3 Loss:1130.4477132161458
Episode:600 Score:0 Moves:4 Loss:886.9776611328125
Episode:700 Score:0 Moves:3 Loss:951.89501953125
Episode:800 Score:0 Moves:2 Loss:1292.8173828125
Episode:900 Score:0 Moves:1 Loss:1298.5020751953125
Episode:1000 Score:0 Moves:1 Loss:856.287109375
Episode:1100 Score:0 Moves:1 Loss:462.6907958984375
Episode:1200 Score:0 Moves:3 Loss:698.6008097330729
Episode:1300 Score:2 Moves:15 Loss:619.6556935628255
Episode:1400 Score:0 Moves:1 Loss:620.2691650390625
Episode:1500 Score:14 Moves:50 Loss:639.981471862793
Episode:1600 Score:1 Moves:13 Loss:516.7033503605769
Episode:1700 Score:0 Moves:1 Loss:728.6744384765625
Episode:1800 Score:3 Moves:13 Loss:632.8446831336388
Episode:1900 Score:2 Moves:13 Loss:563.5427668644832
Episode:2000 Score:0

In [None]:
play(agent.Q,env1,walls=0.2)

In [None]:
# Saving weights

agent.Q.save_weights("networks/maze_harvest/Qc1.h5")
agent.Q_target.save_weights("networks/maze_harvest/Qtc1.h5")

In [None]:
env1.max_moves = 500

In [None]:
play(agent.Q,env1,walls=0.2,nds=True)

### 3) walls 20%, nds=True

In [None]:
train(agent,env1,2000,42,C=50,ep=100,nds=True)

In [None]:
play(agent.Q,env1,walls=0.2,nds=True)

### 4) walls 30% nds= True

In [None]:
train(agent,env1,3000,42,C=50,ep=100,walls=.3,nds=True)

In [None]:
play(agent.Q,env1,walls=0.3,nds=True)

In [None]:
play(agent.Q,env1,walls=0.2,nds=True)

In [None]:
play(agent.Q,env2,walls=0.2,nds=True)

### Loading weights

In [None]:
agent.Q.load_weights("networks/maze_harvest/Qc1.h5")
agent.Q_target.load_weights("networks/maze_harvest/Qtc1.h5")

In [None]:
play(agent.Q,env1,walls=0.2)

Next:
- train seperate network for nds=True.
- sigmoid for first layer

Agent Training Notebook V1