# Deep learning of rewards

This illustrates the **DeepRewardController**. This controller uses a neural network to attempt to predict the action with the highest reward value. In other words, a perfectly-trained controller would be identical to the **RewardController**.

In [None]:
from pod.board import PodBoard
from pod.ai.deep_reward_controller import DeepRewardController
from pod.ai.rewards import speed_reward

board = PodBoard.circle(4).shuffle
controller = DeepRewardController(board, speed_reward)

### Training

First, we create some training data: a bunch of pods in various states around the target checkpoint.

In [None]:
from pod.ai.ai_utils import gen_pods
from pod.constants import Constants
import math
import numpy as np

pods_everywhere = gen_pods(
    board.checkpoints[0],
    np.arange(Constants.check_radius(), 10000, 1000),
    np.arange(math.pi * -0.9, math.pi * 0.91, math.pi * 0.2),
    np.arange(math.pi * -0.9, math.pi * 0.91, math.pi * 0.2),
    np.arange(0, Constants.max_vel() + 1, Constants.max_vel() / 5)
)

# TODO: training goes much better if I add extra pods pointing towards the check...why?
pods_focused = gen_pods(
    board.checkpoints[0],
    np.arange(Constants.check_radius(), 10000, 1000),
    np.arange(-0.3, 0.3, 0.05),
    np.arange(math.pi * -0.9, math.pi * 0.91, math.pi * 0.2),
    np.arange(0, Constants.max_vel() + 1, Constants.max_vel() / 5)
)

pods = [*pods_everywhere, *pods_focused]

print("{} total states".format(len(pods)))

Now that we have a bunch of pod states, we can perform the training. The labels (i.e. the target output for each state) are calculated as whatever produces the highest reward.

In [None]:
import matplotlib.pyplot as plt

history = controller.train(pods, 50)

plt.plot(history.history['accuracy'])
#plt.plot(history.history['loss'])
plt.legend([
    "Accuracy",
#    "Loss"
])
plt.show()

### Play

Now that the model has been trained, let's see what it can do!

As a comparison, we also add a **SimpleController** (which simply goes full-speed toward the next checkpoint) and **RewardController** (which takes whatever action produces the highest reward).

In [None]:
TURNS = 200

from pod.util import PodState
from pod.drawer import Drawer
from pod.ai.reward_controller import RewardController
from pod.controller import SimpleController

drawer = Drawer(board, controllers=[controller, RewardController(board, speed_reward), SimpleController(board)])

drawer.animate(TURNS)

The following shows the rewards for the players in the above run.

In [None]:
drawer.chart_rewards(speed_reward, TURNS)