<a href="https://colab.research.google.com/github/Benned-H/Summer2019/blob/master/Simple_RL_with_TF/Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Two-armed Bandit [[Link]](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149)

Reinforcement learning needs a different mindset than typical supervised learning; the 'answer' space is much broader. There's no one 'correct' action for an agent to take, but we'll still find ways to learn nonetheless.

The **two-armed bandit**, or more broadly $ n$-armed bandit, is one of the simplest RL problems. We have $n$ slot machines, each with some payout probability, so we have to find the best machine and then maximize our reward by always choosing it. In the case of two machines, we have quite a simple problem, but aspects found in many other RL problems include:
* Different actions yield different rewards.
* Rewards are delayed over time, so we won't immediately know the value of our actions.
* The reward for an action depends on the current state of the environment.

The goal of learning which actions are best and ensuring we choose such actions is called learning a **policy**. In this section, we'll be using a method called **policy gradients**, where a simple ANN uses gradient descent to learn which actions to pick. An alternative to this would be learning **value functions**, where our agent learns to predict how good a given state or action will be (the value of the state/action).

**Policy Gradient**

In the simplest case, suppose our network produces explicit outputs. We can ask the network for an output weight for each possible arm to pull, and we'll pick the arm with the highest given weight. To update the network, we'll try arms using an $\epsilon$-greedy policy. See Part 0 for my notes on this algorithm, but it's quite simple (pick a random arm with probability $\epsilon$, else pick highest weight arm).

We'll give our agent a reward of either -1 or 1, and then update the network with equation:

$\text{Loss}=-\log(\pi)*A$, where $A$ is the **advantage**. This is an essential part of all RL algorithms which corresponds to how much better an action was than some baseline. For now we assume the baseline is 0, so the advantage will just be the reward we recieve. $\pi$ is our policy, which here means the weight of the chosen action.

Consider this loss function. Say we chose a good action with high confidence: reward 1, weight 0.8. Thus $A=1,\pi=0.8\implies\text{Loss}=-\log(0.8)*1=0.22$.

As for high confidence, bad reward: $A=-1,\pi=0.8\implies\text{Loss}=-\log(0.8)*-1=-0.22$.

For low confidence, good reward: $A=1,\pi=0.1\implies\text{Loss}=-\log(0.1)*1=2.3$.

We see that the agent will increase the weight for actions with positive reward, choosing those actions more frequently in the future.

# Learning Some TensorFlow

It's at this point that the reinforcement learning tutorial throws some code below the article and calls it a day. I have no idea how to use TensorFlow placeholders or optimizers; my TensorFlow experience is minimal. I'll go through a few sources on TensorFlow basics here before tackling the problem at hand again.

# Back to the RL Tutorial

Now let's write the code for this problem:

In [0]:
import random
import tensorflow as tf
import numpy as np

bandits = [0.1,0.4,0.7,0.99]
num_bandits = len(bandits)

def pullBandit(bandit):
  # Returns a good reward with odds of the given bandit.
  r = random.random()
  if r < bandit:
    return 1
  else:
    return -1

In [22]:
class PolicyGradient_ep:
  """A policy gradient agent that chooses epsilon-greedy actions."""
  
  def __init__(self, num_actions, ep):
    

SyntaxError: ignored

In [0]:
import tensorflow as tf
import numpy as np

tf.reset_default_graph()

#These two lines established the feed-forward part of the network. This does the actual choosing.
weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights,0)

#The next six lines establish the training proceedure. We feed the reward and chosen action into the network
#to compute the loss, and use it to update the network.
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)


In [21]:
total_episodes = 1000 #Set total number of episodes to train agent on.
total_reward = np.zeros(num_bandits) #Set scoreboard for bandits to 0.
e = 0.01 #Set the chance of taking a random action.

init = tf.initialize_all_variables()

# Launch the tensorflow graph
with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < total_episodes:
        
        #Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(chosen_action)
        
        reward = pullBandit(bandits[action]) #Get our reward from picking one of the bandits.
        
        #Update the network.
        _,resp,ww = sess.run([update,responsible_weight,weights], feed_dict={reward_holder:[reward],action_holder:[action]})
        
        #Update our running tally of scores.
        total_reward[action] += reward
        if i % 50 == 0:
            print("Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward))
        i+=1
print("The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising....")
if np.argmax(ww) == np.argmax(np.array(bandits)):
    print("...and it was right!")
else:
    print("...and it was wrong!")

Running reward for the 4 bandits: [-1.  0.  0.  0.]
Running reward for the 4 bandits: [-1. -2. -1. 43.]
Running reward for the 4 bandits: [-1. -2. -1. 89.]
Running reward for the 4 bandits: [ -1.  -2.  -1. 139.]
Running reward for the 4 bandits: [ -1.  -2.  -1. 185.]
Running reward for the 4 bandits: [ -1.  -2.  -1. 233.]
Running reward for the 4 bandits: [ -1.  -2.  -1. 279.]
Running reward for the 4 bandits: [ -1.  -2.  -1. 329.]
Running reward for the 4 bandits: [ -1.  -2.   0. 376.]
Running reward for the 4 bandits: [ -1.  -1.   0. 421.]
Running reward for the 4 bandits: [ -1.  -1.   0. 471.]
Running reward for the 4 bandits: [ -1.   0.   0. 518.]
Running reward for the 4 bandits: [ -1.   0.   0. 564.]
Running reward for the 4 bandits: [ -2.   0.   0. 611.]
Running reward for the 4 bandits: [ -3.   0.   0. 658.]
Running reward for the 4 bandits: [ -3.   0.   0. 706.]
Running reward for the 4 bandits: [ -3.   0.   0. 756.]
Running reward for the 4 bandits: [ -4.   0.   0. 803.]
Runn

The above code is from the RL tutorial; I need to learn pieces of TensorFlow before reading through this. Links to do so:
https://appdividend.com/2019/02/06/tensorflow-variables-and-placeholders-tutorial-with-example/
https://www.edureka.co/blog/tensorflow-tutorial/

That's next time. Last revised 6/14/2019