# Introduction

Import the Gym library and the game we will play.

In [1]:
%matplotlib inline
from game import *
import gym
from gym.envs.registration import register
from gym import wrappers
import numpy as np

Register the environment and create an instance of it. The environment is a roulette game. You enter the casino with a random number of tokens between $0$ and $10$. Then you bet as in a normal roulette. The roulette wheel has $37$ spots. If the bet is $0$ and $0$ comes up, you win a reward of $35$ tokens. If the parity of your bet matches the parity of the spin, you win $1$ token. Otherwise, you lose $1$ token.

The observation provided by the game consists of two parts:
<ol>
 <li> Your cumulative reward so far.
 <li> The last number that has fallen on the roulette.
</ol>
This representation is not perfect for reinforcement learning settings; however, it illustrates some common problems with reinforcement learning algorithms.

Action $37$ means that you want to cash your tokens and walk away. Your observation will still contain the number of tokens you held and the next outcome on the roulette. The casino does not allow you to play on loan, therefore if you own $0$ tokens, the game automatically ends.

The problem is the following - cassino is cheating and the roulette is false. When you own ten or more tokens, the dealer secretely decreases your winning probability.  Also if you own more than $20$ tokens, the security becomes suspicious and expels you from the casino. Use TD-learning to show that the roulette is not fair.

In [2]:
register(
    id='smu-rl-roulette2019-v0',
    entry_point='game:RouletteEnv'
)

In [3]:
envsimple = gym.make('smu-rl-roulette2019-v0')

<h1> The passive reinforcement learning agent using temporal difference. </h1>

First, we will implement an agent with a fixed learning rate. We will use a fixed strategy and observe how well it performs. The policy with $20\,\%$ walks away and changes the tokens into reward. Otherwise, the agent randomly picks a number; all have the same probability. The policy is not deterministic in this case, however, TD-learning will work anyway.

In [4]:
def policy(observation):
    if observation[1] > 27 or np.random.random() < 0.2:
        return 37
    return np.random.randint(0, 37)

For now, we will use a fixed learning rate to see how the policy performs.

In [5]:
alpha = 0.1

Now we can implement the agent. The TD method follows the pseudocode below:
<ol> <li> Repeat (for each episode):
     <ol> <li> Initialize $s$ as the start state.
          <li> Repeat (for each step):
          <ol> <li> $a \gets$ action given by $\pi$ for $s$
               <li> Take action $a$; observe reward, $r$, and the next state $s'$
               <li> $U(s) \gets U(s) + \alpha \left( r + \gamma U(s') - U(s) \right)$
               <li> $s \gets s'$
          </ol>
          <li> until $s$ is terminal
      </ol>
</ol>
The pseudocode is taken from <a href="https://mitpress.mit.edu/books/reinforcement-learning">Sutton, Barto book, figure 6.1</a>.

First, we need to initialize the number of episodes and discount factor. Pick your own values.

In [6]:
number_of_epochs = 100 # TODO pick your own number and discount factor
discount_factor = 0.5# TODO

The template for the code is provided and is similar to the one you have in your project. Modify the code as handy. Documentation is available on <a href="https://gym.openai.com/docs/">https://gym.openai.com/docs/</a>. You already see method <code>env.reset</code> and <code>env.render</code>. The last important method we will need is <code>env.step</code>.

In [7]:
env = wrappers.Monitor(envsimple, 'smurltutorial', force=True, video_callable=False)
U = # TODO : define utility function, i.e., a dictionary, an array or anything else

for i in range(number_of_epochs):
    observation = env.reset()
    # TODO your code here
    # in each step you will need to call method env.step() with the action given by the policy() method above
    
    #env.render() will show you the state of the environment

SyntaxError: invalid syntax (<ipython-input-7-29bf037f16d9>, line 2)

<h1>Convergence of utility values</h1>

Now when the code works copy it to the cell below and checks whether the utility values converged. In each iteration store the utility of state $(10,1)$. Store the maximum update made to $U(s)$ for any $s$.

In [None]:
env = wrappers.Monitor(envsimple, 'smurltutorial', force=True, video_callable=False)
U = # TODO : define utility function
Us101 = np.zeros(number_of_epochs)
maxDelta = np.zeros(number_of_epochs)

for i in range(number_of_epochs):
    observation = env.reset()
    # TODO your code here
    #env.render()
    maxDelta[i] = # TODO : store the maximum update to U(e)
    Us101[i] = # TODO :  store the value of a state (you can pick a different one if you want)

Now we will render the utility over time. First, we need to import matplotlib. Use <code>pip install matplotlib</code> (unless you did on the last tutorial).

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

We will use this method to plot a vector of numbers.

In [None]:
def plot_series(arr, fileName = None):
    plt.plot(arr)
    if fileName is not None:
        plt.savefig(fileName)

Now plot the $U$ value of the state $(10,1)$.

In [None]:
plot_series(Us101)
# put a name of a file as a second parameter if you want to save the figure

We would expect the value to be around $10$. The result may depend on the value of the learning rate. If you get a plot like the following one, the values did not converge.
<center>
  <img src="not_converged.png">
</center>
However, if your results look like the one below, the value for this state is correct. The values actually don't converge since we set a constant learning rate. The average, however, converges.
<center>
    <img src="converged.png">
</center>

The value of maximum update to the state utility function is explanatory as well.

In [None]:
plot_series(maxDelta)

The plot should look like the one below.
<center>
  <img src="delta_constant_alpha.png">
</center>

The utility function did not converge. However <b>the average values </b> of $U(s)$ converge. (optionally check yourself)

In [None]:
# TODO: optionally check that the average values of $U(s)$ converge

Now we may check the value of the state when we own ten (or nine tokens):

In [None]:
U101 = # read from U value of state when you own 10 tokens and there is one on the roulette
U091 = # the same for 9 tonens and one on the roulette

print(U101)
print(U091)

Strange, isn't it?

<h1> Decrease the learning rate </h1>
If your state utility function converged in the last section, you are lucky. However, we know the solution from the lecture - the value of learning rate $\alpha$ should be decreasing with the number of trials. More specifically, with the number of visits to the current state. In the last plot, the maximum change was approximately constant over time. As a result, the utility values oscillate.

Therefore, copy the code from the last section to the cell below and choose some function so that the value of learning rate $\alpha$ decreases with the number of visits of state $s$. [Answer the following question yourself: Why should be learning rate different for each state $s$?]

In [None]:
env = wrappers.Monitor(envsimple, 'smurltutorial', force=True, video_callable=False)
U = # TODO : define utility function
Ns = # TODO : this time you have to store the number of visits of each state
Us101 = np.zeros(number_of_epochs)
maxDelta = np.zeros(number_of_epochs)

for i in range(number_of_epochs):
    observation = env.reset()
    # TODO your code here
    #env.render()
    maxDelta[i] = # TODO : store the maximum update to U(s)
    Us101[i] = # TODO :  store the value of a state (you can pick a different one)

Run your code again, this time with $\alpha$ decreasing with the number of visits of a state. Check the result.

In [None]:
plot_series(Us101)

If your state utility function looks like the one below, you won, because the values converged. Also, the value looks reasonable. If not, try again.
<center>
  <img src="converged_decreasing_alpha.png">
</center>
Hint (select the text to read - it is in white): <span style="color:white">In AIMA they use $$\alpha = \frac{c}{c - 1 + \mbox{number of visits}}$$ for a constant $c$

Generally, the $\alpha$ parameter should be selected so that $\sum_t \alpha_t$ diverges and $\sum_t \alpha_t^2$ converges.</span>

Plot the maximum change in the state utility function.

In [None]:
plot_series(maxDelta)

We see that the update is decreasing (as expected).
<center>
  <img src="delta_decreasing_alpha.png">
</center>
However, in this case, the result is not nice as the maximum is not a robust statistics. Why do we see such high peaks even after $1 000 000$ games played? Compare with the following picture.
<center>
    <img src="delta_decreasing_alpha_last_year.png">
</center>

<h1> Further work </h1>

If you got to this point, you are able to continue yourself without detailed instructions. Here are some ideas what you may want to do in the remaining time:
<ul>
  <li> Go to <code>game.py</code> file and change the number of states. Does the TD method scale? What is the maximum number of states?
  <li> On <a href="https://gym.openai.com/">https://gym.openai.com/</a> you may find plenty of environments. Pick one and pick one of the strategies that were submitted to the page. Then estimate the state utility function of the strategy.
  <li> Implement the adaptive dynamic programming algorithm and compare it to the TD.
  <li> Whatever you are interested in ...
</ul>