![Logo](../../assets/logo.png)

Made by **Domonkos Nagy**

[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/Fortuz/rl_education/blob/main/3.%20Dynamic%20Programming/Gambler's%20Problem/gamblers_problem.ipynb)

# Gambler's Problem

A gambler has the opportunity to make bets on
the outcomes of a sequence of coin flips. If the coin comes up heads, he wins as many
dollars as he has staked on that flip; if it is tails, he loses his stake.

<img src="assets/coinflip.jpg" width="500"/>

The game ends
when the gambler wins by reaching his goal of \\$100, or loses by running out of money.
On each flip, the gambler must decide what portion of his capital to stake, in integer
numbers of dollars. This problem can be formulated as an undiscounted, episodic, finite
MDP. The state is the gambler’s capital, $ s \in {1, 2, . . . , 99} $ and the actions
are stakes, $ a \in {0, 1, . . . , min(s, 100 - s)} $. The reward is zero on all transitions
except those on which the gambler reaches his goal, when it is +1.

Our goal is to find the optimal policy for this problem. We will implement a simple
algorithm that uses *value iteration* to solve the Bellman equation for the state-value
function.

- This notebook is based on Chapter 4 of the book *Reinforcement Learning: An Introduction (2nd ed.)* by R. Sutton & A. Barto, available at http://incompleteideas.net/book/the-book-2nd.html

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
PROB_HEADS = 0.4  # Probability that a coin comes up heads
THETA = 1e-12  # Error treshold for the value iteration
GAMMA = 1  # Discount factor

In [None]:
v_estimations = np.zeros(101)  # Value function
policy = np.zeros(101, dtype=int)  # Policy

## Value Iteration

In *value iteration*, we start with an arbitrary initial value function and then iteratively improve it until it converges to the optimal value function. At each iteration, we update the value of each state based on the Bellman optimality equation, which states that the value of a state is equal to the immediate reward plus the discounted value of the successor states, weighted by the probability of transitioning to those states under the optimal policy.
The algorithm uses the following update rule:

$$ v_{k+1}(s) = \max_a \sum_{s', r}p(s', r | s, a) [r + \gamma v_k(s')] $$

By repeatedly applying the Bellman optimality equation, the value function converges to the optimal value function. In practice, we stop when the magnitude of the greatest update, $\delta$ falls below a sufficiently low treshold, $\theta$.
The code below implements value iteration for this problem, and plots out the value function after each iteration.

***

### **Your Task**

Implement value iteration! The block below only contains code necessary for plotting the state-value function at every iteration. The algorithm itself is up to you! Pseudocode for this algorithm is shown in the box below.

<img src="assets/value_iteration.png" width="700"/>

*Pseudocode from page 83 of the Sutton & Barto book*

#### **Hints:**

- Remember that terminal states are excluded from the update loop!
- When comparing action values, you should ignore differences lower than the threshold ($\theta$).

***

In [None]:
delta = THETA
i = 0
plt.figure(figsize=(10,7))

while delta >= THETA:

    # TODO: Complete the algorithm!

    # Add the resulting value funtion to the plot
    plt.plot(np.arange(1, 100), v_estimations[1:100])
    i += 1

# Plot state-value functions
plt.xlabel('Capital')
plt.ylabel('Value')
plt.grid(alpha=0.8, linestyle=':', zorder=0)
plt.title(f'State-value function after {i} sweeps')
plt.show()

## Optimal policy
The resulting policy can be seen below. The x axis corresponds to the state (the current capital), and the y axis shows the optimal amount of money to bet in a given state. As you can see, the graph has a peculiar, self-similar shape. Can you explain why? Think about how you would approach this problem, and keep the value of `PROB_HEADS` in mind!

In [None]:
# Plot policy
plt.figure(figsize=(10,7))
plt.bar(np.arange(101), policy)
plt.xlabel('Capital')
plt.ylabel('Stake')
plt.title('Policy')
plt.grid(alpha=0.8, linestyle=':', zorder=0)
plt.show()