# Intro

The plan is to have a player blob (blue), which aims to navigate its way as quickly as possible to the food blob (green), while avoiding the enemy blob (red). Now, we could make this super smooth with high definition, but we already know we're going to be breaking it down into observation spaces. Instead, let's just start in a discrete space. Something between a 10x10 and 20x20 should suffice. Do note, the larger you go, the larger your Q-Table will be in terms of space it takes up in memory as well as time it takes for the model to actually learn. So, our environment will be a 20 x 20 grid, where we have 1 player, 1 enemy, and 1 food. For now, we'll just have the player able to move, in attempt to reach the food, which will yield a reward.

## Explanation
### 1.Hyperparameters and Constants
Grid and Episodes:

SIZE: Defines the size of the grid environment as 10x10.
HM_EPISODES: The total number of episodes (iterations) for which the agent will be trained.
Rewards and Penalties:

MOVE_PENALTY: The penalty (negative reward) for each move made by the player.
ENEMY_PENALTY: The penalty for the player colliding with the enemy.

FOOD_REWARD: The reward for the player reaching the food.

Exploration-Exploitation Parameters:

epsilon: Initial probability of choosing a random action (exploration).

EPS_DECAY: Factor by which epsilon decays after each episode, reducing exploration over time.
Display Control:

SHOW_EVERY: Controls how often (in terms of episodes) the environment is visually displayed.
Q-Learning Parameters:

start_q_table: A filename to load a pre-trained Q-table or None to start fresh.

LEARNING_RATE: Determines how much newly acquired information overrides old information.

DISCOUNT: Discount factor for future rewards.
Identifiers and Colors:

PLAYER_N, FOOD_N, ENEMY_N: Numeric identifiers for the player, food, and enemy in the environment.
d: A dictionary mapping these identifiers to RGB color values for visualization.

### 2. Blob classification
Blob Class:

Represents an entity (player, food, or enemy) on the grid.
Constructor (__init__):

Initializes the blob at a random position within the grid.
__str__ Method:

Returns a string representation of the blob's coordinates, useful for debugging.
__sub__ Method:

Defines the subtraction operation between two blobs, returning their relative distance as a tuple (dx, dy).
action Method:

Takes an action (0-3) that moves the blob diagonally in one of four directions.
move Method:

Moves the blob based on provided x and y values or randomly if not provided.
Ensures the blob remains within grid boundaries.
### 3. Q_table initialization
Q-Table:
The Q-table is a dictionary that maps observations (states) to a list of Q-values corresponding to each possible action.
Initialization:
If start_q_table is None, the code initializes the Q-table with random values for all possible states.
Each state is represented as a tuple of two differences: (player-food, player-enemy), and each entry in the table contains four Q-values, one for each possible action.
Loading a Pre-trained Q-Table:
If start_q_table is not None, it loads an existing Q-table from a file using pickle.

### 4. Main training loop
At the start of each episode, the player, food, and enemy are initialized as Blob objects at random positions on the grid.
Every SHOW_EVERY episodes, the code sets show to True and prints the current episode number and the average reward for the last SHOW_EVERY episodes.
This ensures the environment is visually rendered at intervals, allowing observation of the agent's behavior.

### 5. Episode execution
Observations and Actions:

obs: The current state, represented by the relative positions of the player to the food and enemy.
The agent selects an action using an epsilon-greedy strategy:
With probability epsilon, it takes a random action (exploration).
Otherwise, it chooses the action with the highest Q-value for the current state (exploitation).
Action Execution:

The chosen action is executed by calling player.action(action), which moves the player on the grid.

# Requirements

In [10]:
import numpy as np
from PIL import Image  # for creating visual env
import cv2  # for showing our visual live
import matplotlib.pyplot as plt
import pickle  # to save/load Q-Tables
from matplotlib import style  # to make pretty charts.
import time  # using this to keep track of our saved Q-Tables.

# Environment size, constants and variables
A 10x10 Q-Table for example, in this case, is ~15MB. A 20x20 is ~195MB

In [11]:
style.use('ggplot')
SIZE = 10
HM_EPISODES = 25000
MOVE_PENALTY = 1
ENEMY_PENALTY = 300
FOOD_REWARD = 25
epsilon = 0.9
EPS_DECAY = 0.9998
SHOW_EVERY = 3000
# In case you have a q table, load here (filename)
start_q_table = None
LEARNING_RATE = 0.1
DISCOUNT = 0.95
# key in dict
PLAYER_N = 1
FOOD_N = 2
ENEMY_N = 3
# Dict for colors BGR
d = {1: (255, 175, 0),
     2: (0, 255, 0),
     3: (0, 0, 255)}

# Blob

In [12]:
class Blob:
    def __init__(self):
        self.x = np.random.randint(0, SIZE)
        self.y = np.random.randint(0, SIZE)
    def __str__(self):
        return f'{self.x}, {self.y}'
    def __sub__(self, other):
        return (self.x - other.x, self.y - other.y)
    def action(self, choice):
        if choice == 0:
           self.move(x=1, y=1)
        elif choice == 1:
            self.move(x=-1, y=-1)
        elif choice == 2:
            self.move(x=-1, y=1)
        elif choice == 3:
            self.move(x=1, y=-1)
    def move(self, x=False, y=False):
        if not x:
            self.x += np.random.randint(-1, 2)
        else:
            self.x += x
        if not y:
            self.y += np.random.randint(-1, 2)
        else:
            self.y += y
        
        if self.x < 0:
            self.x = 0
        elif self.x > SIZE-1:
            self.x = SIZE-1
        if self.y < 0:
            self.y = 0
        elif self.y > SIZE-1:
            self.y = SIZE-1

# Q table

In [None]:
if start_q_table is None:
    q_table = {}
    # (x1, y1), (x2, y2)
    for x1 in range(-SIZE+1, SIZE):
        for y1 in range(-SIZE+1, SIZE):
            for x2 in range(-SIZE+1, SIZE):
                for y2 in range(-SIZE+1, SIZE):
                    q_table[((x1, y1),(x2,y2))] = [np.random.uniform(-5, 0) for i in range(4)]
else:
    with open(start_q_table, 'rb') as f:
        q_table = pickle.load(f)

episode_rewards = []
for episode in range(HM_EPISODES):
    player = Blob()
    food = Blob()
    enemy = Blob()

    if episode % SHOW_EVERY == 0:
        print(f'on # {episode}, epsilon: {epsilon}')
        print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}")
        show = True
    else:
        show = False

    episode_reward = 0

    for i in range(200):
        obs = (player - food, player - enemy)
        if np.random.random() > epsilon:
            action = np.argmax(q_table[obs])
        else:
            action = np.random.randint(0, 4)

        player.action(action)
        '''
        MAYBE
        enemy.move()
        food.move()
        '''
        # Rewarding
        if player.x == enemy.x and player.y == enemy.y:
            reward = -ENEMY_PENALTY
        elif player.x == food.x and player.y == food.y:
            reward = FOOD_REWARD
        else:
            reward = -MOVE_PENALTY

        # Q values and information
        new_obs = (player - food, player - enemy)
        max_future_q = np.max(q_table[new_obs])
        current_q = q_table[obs][action]

        if reward == FOOD_REWARD:
            new_q = FOOD_REWARD
        elif reward == -ENEMY_PENALTY:
            new_q = -ENEMY_PENALTY
        else:
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)

        q_table[obs][action] = new_q

        episode_rewards += reward

        # Displaying the environment
        if show:
            env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8)
            env[food.x][food.y] = d[FOOD_N]
            env[player.x][player.y] = d[PLAYER_N]
            env[enemy.x][enemy.y] = d[ENEMY_N]

            img = Image.fromarray(env, 'RGB')
            img = img.resize((300, 300))
            cv2.imshow('', np.array(img))

            # Handling rewards
            if reward == FOOD_REWARD or reward == -ENEMY_PENALTY:
                if cv2.waitKey(500) & 0xFF == ord('q'):
                    break
            else:
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break

    episode_rewards.append(episode_reward)
    epsilon *= EPS_DECAY

# Graphs and savings


In [None]:
moving_avg = np.convolve(episode_rewards, np.ones((SHOW_EVERY,)) / SHOW_EVERY, mode='valid')
plt.plot([i for i in range(len(moving_avg))], moving_avg)
plt.ylabel(f"Reward {SHOW_EVERY}ma")
plt.xlabel("episode #")
plt.show()

with open(f"qtable-{int(time.time())}.pickle", "wb") as f:
    pickle.dump(q_table, f)