# VTHacks 8: Deep Reinforcement Learning
### GitHub Repo: https://github.com/AndrewAF1/vthacks-rl-workshop
### Presentors: Andrew Farabow (aafarabow@vt.edu), Patrick Riley (rileyp@vt.edu)

## What is reinforcement learning?

### Short Answer
Using intelligent trial and error to solve problems

### Long Answer 
A reinforcement learning problem consists of an agent which interacts with an environment in discrete timesteps. At each timestep the agent is given an observation $s$, takes an action a based on policy $\pi$, and receives a reward $r$. The policy, which defines an agent’s behavior, is a function that maps the state to a probability distribution over the (discrete or continuous) set of possible actions. The defining characteristic of deep reinforcement learning is that the policy is represented by a neural network, which is generally trained via some variant of stochastic gradient descent to maximize expected reward. In order to make a challenge learnable via RL, one must define an observation and action space. These are vectors that define the shape and range of possible values that can compose the observation and action. The observation space consists of the features of the game we let the agent learn to make decisions from, and the action space is determined by what parts of the environment we decide to have the agent control. 

### Translation
RL involves giving an "agent" a goal (along with a way of measuring that goal) and allowing to figure out how to reach it on its own. It does this by playing the game repeatedly and trying to improve its score by combining exploration (trying new things) with exploitation (doing what it has learned works well).


## What does the "deep" part mean?
There are many ways to solve RL problems. "Deep" means that a neural network is used to represent the "brain" of the algorithm - it encodes the method by which the agent makes decisions and allows that method to be updated as new data becomes available.

## What we are gonna get done today
* solve a classic reinforcement learning problem, Cartpole
* explore a basic RL algorithm, Deep Q-Learning
* tune various parameters and see how that affects learning

## Tools we are using to get there
* Python (NOTE: ADD LINKS LATER)
* PyTorch
* OpenAI Gym
* Visdom

![Cartpole](https://i.redd.it/sqjzj2cgnpt21.gif)

## Lets start the code!

Before we get to the cool stuff, lets give ourselves the ability to visualize our model's improvement with a tool called Visdom, from Facebook Research. If you are following along, you need to open a terminal and run visdom (NOTE: ADD INSTRUCTIONS FOR WINDOWS AND PIPENV)

In [None]:
import numpy as np
from visdom import Visdom

viz = Visdom()

win = None

def update_viz(ep, ep_reward, algo):
    global win

    if win is None:
        win = viz.line(
            X=np.array([ep]),
            Y=np.array([ep_reward]),
            win=algo,
            opts=dict(
                title=algo,
                fillarea=False
            )
        )
    else:
        viz.line(
            X=np.array([ep]),
            Y=np.array([ep_reward]),
            win=win,
            update='append',
            xaxis='Episodes',
            yaxis='Reward'
        )

The code below creates a neural network in PyTorch. The `nn.Linear...` lines represent fully connected layers and the `nn.ReLU..` lines are activation functions, which allow the network to represent non-linear functions. The `forward` function gets called when you want to get the output of the model for some data (called a forward pass).

In [None]:
import torch
import torch.nn as nn

class Q(nn.Module):
    def __init__(self,env):
        super(Q, self).__init__()

        self.main = nn.Sequential(
            nn.Linear(env.observation_space.shape[0], 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, env.action_space.n)
        )

    def forward(self, s):
        return self.main(torch.FloatTensor(s))

In [1]:
import gym
import numpy as np
import torch.nn.functional as F
import random
from copy import deepcopy

NOTE: EXPLAIN STUFF HERE

In [None]:
algo_name = 'DQN'
env = gym.make('CartPole-v0')
epsilon = .01
gamma = .99
#Proportion of network you want to keep
tau = .995
random.seed(666)

NOTE: REMOVE TARGET NETWORK AND EXPLAIN

In [None]:
#Updates the Q by taking the max action and then calculating the loss based on a target
def update():
    s, a, r, s2, m = rb.sample(batch_size)

    with torch.no_grad():
        max_next_q, _ = q_target(s2).max(dim=1, keepdim=True)
        y = r + m*gamma*max_next_q
    loss = F.mse_loss(torch.gather(q(s), 1, a.long()), y)

    #Update q
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    #Update q_target
    for param, target_param in zip(q.parameters(), q_target.parameters()):
        target_param.data = target_param.data*tau + param.data*(1-tau)


NOTE: EXPLAIN

In [2]:
#Explores the environment for the specified number of timesteps to improve the performance of the DQN
def explore(timestep):
    ts = 0
    while ts < timestep:
        s = env.reset()
        while True:
            a = env.action_space.sample()
            s2, r, done, _ = env.step(int(a))
            rb.store(s, a, r, s2, done)
            ts += 1
            if done:
                break
            else:
                s = s2

NOTE: EXPLAIN

In [None]:
import random
from collections import deque

class ReplayBuffer():
    def __init__(self, size):
        self.buffer = deque(maxlen=int(size))
        self.maxSize = size
        self.len = 0

    def sample(self, count):
        count = min(count, self.len)
        batch = random.sample(self.buffer, count)

        s_arr = torch.FloatTensor(np.array([arr[0] for arr in batch]))
        a_arr = torch.FloatTensor(np.array([arr[1] for arr in batch]))
        r_arr = torch.FloatTensor(np.array([arr[2] for arr in batch]))
        s2_arr = torch.FloatTensor(np.array([arr[3] for arr in batch]))
        m_arr = torch.FloatTensor(np.array([arr[4] for arr in batch]))

        return s_arr, a_arr.unsqueeze(1), r_arr.unsqueeze(1), s2_arr, m_arr.unsqueeze(1)

    def len(self):
        return self.len

    def store(self, s, a, r, s2, d):
        def fix(x):
            if not isinstance(x, np.ndarray): return np.array(x)
            else: return x

        data = [s, np.array(a,dtype=np.float64), r, s2, 1 - d]
        transition = tuple(fix(x) for x in data)
        self.len = min(self.len + 1, self.maxSize)
        self.buffer.append(transition)

NOTE: BREAK INTO MANY CELLS, ADD RENDER TO LOOP, AND EXPLAIN

In [None]:
q = Q(env)
q_target = deepcopy(q)

optimizer = torch.optim.Adam(q.parameters(), lr=1e-3)
max_ep = 1000

batch_size = 128
rb = ReplayBuffer(1e6)

explore(10000)
ep = 0
while ep < max_ep:
    s = env.reset()
    ep_r = 0
    while True:
        with torch.no_grad():
            #Epsilon greed exploration
            if random.random() < epsilon:
                a = env.action_space.sample()
            else:
                a = int(np.argmax(q(s)))
        #Get the next state, reward, and info based on the chosen action
        s2, r, done, _ = env.step(int(a))
        rb.store(s, a, r, s2, done)
        ep_r += r

        #If it reaches a terminal state then break the loop and begin again, otherwise continue
        if done:
            update_viz(ep, ep_r, algo_name)
            ep += 1
            break
        else:
            s = s2

        update()

NOTE: CONCLUSION HERE