# This is Akshay's Beginning RL Tutorial  

The theory isn't covered here, just code and explanation

To download gym, make a .sh file and put the following:  
git clone https://github.com/openai/gym  
cd gym  
pip install -e . # minimal install

In [None]:
import gym
import tensorflow as tf
import numpy as np
import time


This cell is just to print all the different environments on OpenAI gym  
Cool, so now we can pick whatever we want  

In [None]:
#for envs in gym.envs.registry.all():
#    print(envs)

This is to make our environment, Cart Pole v0 and reset it.
  
If you get an **error**, like  
  
**"gym.spaces.Box autodetected dtype as . Please provide explicit dtype"**  
  
then just cd gym/spaces and in box.py change dtype of init to np.float32  

In [None]:
env = gym.make('CartPole-v0')
observation=env.reset()

The next cell just prints the number of possible actions and one such action

In [None]:
#print(env.action_space)
#print(env.action_space.sample())

## Don't "RENDER" in Jupyter Notebooks
### Just make a python file and run it

In [None]:
for t in range(100):
    #env.render()
    # Render has been commented out
    print(observation)
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    time.sleep(0.1)
    if done:
        env.close()
        break

# OK, We finished the setup phase
## Now let us move on to making classes to control the RL
  
The basic procedure is the same  
  
**1) First make a Harness class to use once you made your agent.  
2) Next make an Agent class which has parameters and a policy function. (parameters are for your policy)  
3) Finally make an external or internal method to train.**

This Harness class is used to harness the power of our agent and its environment.  
It basically runs the episodes when an agent and a env is passed in.

In [None]:
class Harness:

    def run_episode(self, env, agent):
        observation = env.reset()
        total_reward = 0
        for _ in range(1000):
            action = agent.next_action(observation)
            observation, reward, done, info = env.step(action)
            total_reward += reward
            if done:
                break
        return total_reward

## A Linear Agent has a linear function approximation of the Policy
OK, OK, what do we mean by that?  
Well, a policy is something that gives us the action we need to take when we give it our **state**.  
  

$\Pi$\*(s) = argmax<sub>a</sub> E[Q(s,a)] = the best action  
  
The linear agent takes in vector of size 4 and dot products it with its own vector of size 4.  
## Note: Because this is a continuous space problem, we didn't use discrete DP.

In [None]:
class LinearAgent:

    def __init__(self):
        # We use *2-1 because we know that numbers around 1 will do the trick
        self.parameters = np.random.rand(4) * 2 - 1

    def next_action(self, observation):
        return 0 if np.matmul(self.parameters, observation) < 0 else 1

### For any training function
Steps  
**
1) Make the env  
2) init params, rewards, agent, harness  
3) for loop of harness running agent, seeing rewards, updating param**

In [None]:
def random_search():
    env = gym.make('CartPole-v0')
    best_params = None
    best_reward = 0
    agent = LinearAgent()
    harness = Harness()

    for step in range(10000):
        agent.parameters = np.random.rand(4) * 2 - 1
        reward = harness.run_episode(env, agent)
        if reward > best_reward:
            best_reward = reward
            best_params = agent.parameters
            if reward == 200:
                print('200 achieved on step {}'.format(step))

    print(best_params)

In [None]:
def hill_climbing():
    env = gym.make('CartPole-v0')
    noise_scaling = 0.1
    best_reward = 0
    reward = 0
    agent = LinearAgent()
    harness = Harness()

    for step in range(10000):
        old_params = agent.parameters
        agent.parameters += noise_scaling * (np.random.rand(4) * 2 - 1)
        run = harness.run_episode(env, agent)
        if run > best_reward:
            best_reward = run
            print('Step: {}, New Record: {}, Policy: {}'.format(step, best_reward, agent.parameters))
        else:
            agent.parameters = old_params

        if reward == 200:
            break

## Now let us make an MAB environment
Here we will not use the gym library.  
  
**This is to show how we don't need gym to do work**

In [None]:
class MultiArmedBandit:

    def __init__(self):
        self.bandit = [0.2, 0.0, 0.1, -4.0]
        self.num_actions = 4

    def pull(self, arm):
        return 1 if np.random.randn(1) > self.bandit[arm] else -1

This is an excellent way to introduce probabilities. Basically, to get a certain prob we just have to pick an x0 where P[X>x0]=the value we want.

## OK, now we will introduce tensorflow into our agent's policy function
  
How are we using tensorflow here?  
We start of by keeping place holders for reward at time t-1, action we did at t-1, and the weight matrix. **We are not multiplying the weight matrix in this example.** Here, we will just keep a **single number** to that will be modified by the loss function. We obtain this single number by the split function.  
  
Instead of doing that slice stuff, you can just use one hot encoding and matmul as shown in contextual bandit.
  
The predict function just runs best action. The rand or pred is $\epsilon$ greedy. The train function runs the optimiser.  
This is like 4 different NN, where each activates only when we want them to.

In [None]:
class Agent:

    def __init__(self, actions=4):
        self.num_actions = actions
        self.reward_in = tf.placeholder(tf.float32, [1], name='reward_in')
        self.action_in = tf.placeholder(tf.int32, [1], name='action_in')

        self.W = tf.get_variable('W', [self.num_actions])
        self.best_action = tf.argmax(self.W, axis=0)

        action_weight = tf.slice(self.W, self.action_in, [1])
        policy_loss = -(tf.log(action_weight) * self.reward_in)
        self.optimizer = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(policy_loss)

    def predict(self, sess):
        return sess.run(self.best_action)

    def random_or_predict(self, sess, epsilon):
        if np.random.rand(1) < epsilon:
            return np.random.randint(self.num_actions)
        else:
            return self.predict(sess)

    def train(self, sess, action, reward):
        #The optimiser will calculate the gradient and do one update.
        #This is like mini batch training
        sess.run(self.optimizer, {
            self.action_in: [action],
            self.reward_in: [reward]
            })


Now we need to write the training loop. 

In [None]:
env = MultiArmedBandit()
agent = Agent()
EPSILON = 0.1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer()) # This is init function of TF
    for _ in range(100000):
        action = agent.random_or_predict(sess, EPSILON)
        reward = env.pull(action)
        agent.train(sess, action, reward)
    
    # results time
    print(np.argmin(np.array(env.bandit)))
    #The one below will just see how well our prediction is
    #If it matches above then cool
    print(agent.predict(sess))

I observed that even if we run the code for 50,000 iterations, it doesn't converge. We get outputs like 3 2.

## Now for Contextual Bandit
This now has states and a get bandit function. We need this function because our RL agent needs to sense its environment. It needs to know which state it is in.

In [None]:
class ContextualBandit:

    def __init__(self):
        self.active_bandit = 0  # state
        self.bandits = np.array([
            [0.2, 0.0, 0.1, -4.0],  # 4th arm best
            [0.1, -5.0, 1.0, 0.25],  # 2nd arm best
            [-3.5, 2.0, 3.2, 6.4]  # 1st arm best
        ])
        self.num_bandits, self.num_actions = self.bandits.shape

    
    def get_bandit(self):
        self.active_bandit = np.random.randint(0, self.num_bandits)
        return self.active_bandit

    def pull(self, arm):
        bandit = self.bandits[self.active_bandit, arm]
        return 1 if np.random.randn(1) > bandit else -1
 

Now for the agent.  
Instead of doing the splice thing as before, we just do 
```python
tf.matmul(context_one_hot, W)
```
which is followed up by the sigmoid and then argmax function.  
OK, but why the extra sigmoid and argmax? Can't we just argmax without sigmoid?  
Yeah, because it is sigmoid monotonic, you can. We just wanted it to look like an NN.

## Don't simply run the code below. Take it from the github link of the original code.

In [None]:
class Agent1:

    def __init__(self, learning_rate=1e-3, contexts=3, actions=4):
        self.num_actions = actions
        
        self.reward_in = tf.placeholder(tf.float32, [1], name='reward_in')
        self.context_in = tf.placeholder(tf.int32, [1], name='context_in')
        self.action_in = tf.placeholder(tf.int32, [1], name='action_in')

        # sess.run(best_action) to calculate the best action
        context_one_hot = tf.one_hot(self.context_in, contexts)
        W = tf.get_variable('W', [contexts, actions])
        
        self.output = tf.nn.sigmoid(tf.matmul(context_one_hot, W))
        self.best_action = tf.argmax(self.output, axis=1)

        # sess.run(optimizer) to update the best action
        a_ = tf.reduce_sum(self.output * tf.one_hot(self.action_in, actions))
        self.loss = -(tf.log(a_) * self.reward_in)
        self.optimizer = tf.train.AdamOptimizer(
            learning_rate=learning_rate).minimize(self.loss)

    def predict(self, sess, context):
        #The sess.run is returning a one element matrix, so to get the value from it
        #we just do the [0] at the end
        #ex. if a=[1] then a[0]=1
        #we need the value not a matrix
        return sess.run(self.best_action, {self.context_in: [context]})[0]

    def random_or_predict(self, sess, epsilon, context):
        if np.random.rand(1) < epsilon:
            return np.random.randint(self.num_actions)
        else:
            return self.predict(sess, context)

    def train(self, sess, context, action, reward):
        sess.run(self.optimizer, {
            self.action_in: [action],
            self.reward_in: [reward],
            self.context_in: [context]
        })


The rest is pretty straight forward.  

In [None]:
env1 = ContextualBandit()
agent1 = Agent1()
num_episodes = 300000
epsilon1 = 0.1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for ep in range(num_episodes):
        context = env1.get_bandit()
        action = agent1.random_or_predict(sess, epsilon1, context)
        reward = env1.pull(action)
        # feed state, action, reward back to the policy network
        agent1.train(sess, context, action, reward)
        if ep % 500 == 0:
            loss = sess.run(agent1.loss, {
                agent1.action_in: [action],
                agent1.reward_in: [reward],
                agent1.context_in: [context]
            })
            print('Step {}, Loss={}'.format(ep, loss))
    
    # results time
    print(np.argmin(env1.bandits, axis=1))
    print('Best arm for Bandit 1:')
    print(agent1.predict(sess, 0))

    print('Best arm for Bandit 2:')
    print(agent1.predict(sess, 1))

    print('Best arm for Bandit 3:')
    print(agent1.predict(sess, 2))
