# Deep learning of rewards

This illustrates the **DeepRewardController**. This controller uses a neural network to attempt to predict the action with the highest reward value. In other words, a perfectly-trained controller would be identical to the **RewardController**.

In [1]:
from pod.board import PodBoard
from pod.ai.deep_reward_controller import DeepRewardController

board = PodBoard()
controller = DeepRewardController(board)

### Training

First, we create some training data: a bunch of pods in various states around the target checkpoint.

In [2]:
from pod.ai.ai_utils import gen_pods, MAX_VEL
from pod.constants import Constants
import math
import numpy as np

pods_everywhere = gen_pods(
    board.checkpoints[0],
    np.arange(Constants.check_radius(), 10000, 1000),
    np.arange(math.pi * -0.9, math.pi * 0.91, math.pi * 0.2),
    np.arange(math.pi * -0.9, math.pi * 0.91, math.pi * 0.2),
    np.arange(0, MAX_VEL + 1, MAX_VEL / 5)
)

# TODO: training goes much better if I add extra pods pointing towards the check...why?
pods_focused = gen_pods(
    board.checkpoints[0],
    np.arange(Constants.check_radius(), 10000, 1000),
    np.arange(-0.3, 0.3, 0.05),
    np.arange(math.pi * -0.9, math.pi * 0.91, math.pi * 0.2),
    np.arange(0, MAX_VEL + 1, MAX_VEL / 5)
)

pods = [*pods_everywhere, *pods_focused]

print("{} total states".format(len(pods)))

60000 pods generated
72000 pods generated
132000 total states


Now that we have a bunch of pod states, we can perform the training. The labels (i.e. the target output for each state) are calculated as whatever produces the highest reward.

In [None]:
import matplotlib.pyplot as plt

history = controller.train(pods, 50)

plt.plot(history.history['accuracy'])
#plt.plot(history.history['loss'])
plt.legend([
    "Accuracy",
#    "Loss"
])
plt.show()

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50

### Play

Now that the model has been trained, let's see what it can do!

As a comparison, we also add a **SimpleController** (which simply goes full-speed toward the next checkpoint) and **RewardController** (which takes whatever action produces the highest reward).

In [None]:
TURNS = 200

from pod.util import PodState
from pod.game import Player
from pod.drawer import Drawer
from IPython.display import Image
from pod.ai.reward_controller import RewardController
from pod.controller import SimpleController

deep_player = Player(controller)
base_player = Player(RewardController(board))
simple_player = Player(SimpleController())

drawer = Drawer(board, [deep_player, base_player, simple_player])

deep_player.reset(board)
base_player.reset(board)
simple_player.reset(board)

file = '/tf/notebooks/pods.gif'
drawer.animate(file, TURNS)
Image(filename = file)

The following shows the rewards for the two players in the above run.

In [None]:
deep_player.reset(board)
base_player.reset(board)
simple_player.reset(board)

drawer.chart_rewards(TURNS)