<a href="https://colab.research.google.com/github/Benned-H/Summer2019/blob/master/Simple_Reinforcement_Learning_with_Tensorflow/Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Two-armed Bandit [[Link]](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149)

Reinforcement learning needs a different mindset than typical supervised learning; the 'answer' space is much broader. There's no one 'correct' action for an agent to take, but we'll still find ways to learn nonetheless.

The **two-armed bandit**, or more broadly $ n$-armed bandit, is one of the simplest RL problems. We have $n$ slot machines, each with some payout probability, so we have to find the best machine and then maximize our reward by always choosing it. In the case of two machines, we have quite a simple problem, but aspects found in many other RL problems include:
* Different actions yield different rewards.
* Rewards are delayed over time, so we won't immediately know the value of our actions.
* The reward for an action depends on the current state of the environment.

The goal of learning which actions are best and ensuring we choose such actions is called learning a **policy**. In this section, we'll be using a method called **policy gradients**, where a simple ANN uses gradient descent to learn which actions to pick. An alternative to this would be learning **value functions**, where our agent learns to predict how good a given state or action will be (the value of the state/action).

**Policy Gradient**

In the simplest case, suppose our network produces explicit outputs. We can ask the network for an output weight for each possible arm to pull, and we'll pick the arm with the highest given weight. To update the network, we'll try arms using an e-greedy policy. See Part 0 for my notes on this algorithm, but it's quite simple (pick a random arm with probability $\epsilon$, else pick highest weight arm).

We'll give our agent a reward of either -1 or 1, and then update the network with equation:

$\text{Loss}=-\log(\pi)*A$, where $A$ is the **advantage**. This is an essential part of all RL algorithms which corresponds to how much better an action was than some baseline. For now we assume the baseline is 0, so the advantage will just be the reward we recieve. $\pi$ is our policy, which here means the weight of the chosen action.

Consider this loss function. Say we chose a good action with high confidence: reward 1, weight 0.8. Thus $A=1,\pi=0.8\implies\text{Loss}=-\log(0.8)*1=0.22$.

As for high confidence, bad reward: $A=-1,\pi=0.8\implies\text{Loss}=-\log(0.8)*-1=-0.22$.

For low confidence, good reward: $A=1,\pi=0.1\implies\text{Loss}=-\log(0.1)*1=2.3$.

We see that the agent will increase the weight for actions with positive reward, choosing those actions more frequently in the future.

Now let's write the code for this problem:

In [0]:
import random

bandits = [0.1,0.4,0.7,0.99]

def pullBandit(bandit):
  # Returns a good reward with odds of the given bandit.
  r = random.random()
  if r < bandit:
    return 1
  else:
    return -1