# Deep Reinforcement Learning
In this lab we will implement and train an agent that uses deep learning to play balance the stick in `CartPole-v1`.

## Setup
----
We import useful packages: `gym`, `torch` stuff, etc..

Imports:

In [12]:
import gym
import random
import numpy

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from collections import deque  # for memory
from tqdm import tqdm          # for progress bar

How the game looks (without our agent):

In [13]:
def rnd_game():
    env = gym.make('CartPole-v1', render_mode='human')
    for _ in tqdm(range(10)):
        state, _ = env.reset()
        done = False
        while not done:
            action = env.action_space.sample()
            next_state, reward, done, _, _ = env.step(action)
    env.close()

# rnd_game()

## DQN Algorithm
-------------
We train a policy that tries to maximize the discounted,
cumulative reward
$R_{t_0} = \sum_{t=t_0}^{\infty} \gamma^{t - t_0} r_t$, where
$R_{t_0}$ is *return*. The discount, $\gamma$ is the discount, between $0$ and $1$


Q-learning tries to find a function
$Q^*: State \times Action \rightarrow \mathbb{R}$, maximizes rewards:

$ \begin{align}\pi^*(s) = \arg\!\max_a \ Q^*(s, a)\end{align} $

However, we don't know $ Q^* $. So, we use neural network as a approximators, we can simply create one and train it to resemble $ Q^* $.

For our training update rule, we'll use a fact that every $ Q $
function for some policy obeys the Bellman equation:

$ \begin{align}Q^{\pi}(s, a) = r + \gamma Q^{\pi}(s', \pi(s'))\end{align} $

The difference between the two sides of the equality is known as the temporal difference error, $ \delta $:

$ \begin{align}\delta = Q(s, a) - (r + \gamma \max_a Q(s', a))\end{align} $

### Model
---
Make a deep learning based policy model, that takes in a state and outputs an action.
This model will be an attribute of the Agent we make next.

In [46]:
class Model(nn.Module):
    def __init__(self, observation_size, action_size):
        super(Model, self).__init__()
        # initialise layers here
        self.layer1 = nn.Linear(observation_size, 128)    # create dense layer 1
        self.layer2 = nn.Linear(128, 128)                 # create dense layer 2
        self.layer3 = nn.Linear(128, action_size)         # create dense layer 3
 
    # x is a matrix of integer or float numbers
    def forward(self, x):
        # send x through the network

        x = torch.tensor(x, dtype=torch.float32)
        x = self.layer1(x)
        x = F.relu(x)
        x = self.layer2(x)
        x = F.relu(x)
        x = self.layer3(x)
        
        return x

    def predictBestOutValue(self, x):
        x = self.forward(x)            # send x through neural net
        # res,_ = torch.max(x, dim=0)
        return torch.max(x)            # predict the best reward

    def predictBestAction(self, x):
        x = self.forward(x)               # send x through neural net
        return torch.argmax(x, dim=0)     # predict the best action


### DQN Agent
----
We will be using experience replay memory for training our model.
An Agent's memory is as important as its model, and will be another attribute of our agent.
Other appropriate attributes are the hyperparameters (gamma, lr, etc.).
Give the agent a replay method that trains on a batch from its memory.


In [45]:
class Agent:
    def __init__(self, observation_size, action_size):

        self.observation_size=observation_size
        self.action_size = action_size

        self.criterion = nn.MSELoss()
        self.model = Model(observation_size, action_size)
        self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3)

        # memory that stores N most new transitions
        self.memory_size = 1024
        self.memory = deque()
        self.memory_full = False
        
        # good place to store hyperparameters as attributes
        self.gamma = 0.9        # how much best next state Q value count in the calculation of y
        self.epsilon = 0.8      # probability for random choice while training

    def remember(self, state, action, reward, next_state, done):
        sars = {"state": state,
                "action": action,
                "reward": reward,
                "next_state": next_state,
                "done": done}
        # add to memory
        self.memory.append(sars)
        # remove oldest value if the memory is full
        if self.memory_full:
            self.memory.popleft()
        elif len(self.memory) == self.memory_size :
            self.memory_full = True

    def act(self, state):
        return self.model.predictBestAction(state)

    def replay(self, batch_size):
        # update model based on replay memory
        batch = (random.sample(self.memory, batch_size))
        self.train(batch)

    def train(self, batch):
        
        idxs = numpy.arange(len(batch))
        states = numpy.array([sars["state"] for sars in batch])
        new_states = numpy.array([sars["next_state"] for sars in batch])
        actions = numpy.array([sars["action"] for sars in batch])
        dones = numpy.array([sars["done"] for sars in batch])
        rewards = numpy.array([sars["reward"] for sars in batch])

        self.optimizer.zero_grad()                                      # clean gradients of parameters
        pred = self.model.forward(states)[idxs, actions]                # take the Q value of the action
        y = [rewards[i] if dones[i] else rewards[i] + self.gamma * self.model.predictBestOutValue(new_states[i]) for i in idxs]
        y = torch.tensor(y, dtype = torch.float32)

        loss = self.criterion(pred, y)          # calculate loss with respect to prediction
        loss.backward()                         # calculate gradients of model.parameters() with respect to loss
        self.optimizer.step()                   # update parameters with respect to gradients



### Main Training loop
---
Make a function that takes an environment and an agent, and runs through $ n $ episodes.
Remember to call that agent's replay function to learn from its past (once it has a past).


In [57]:
def train(env, agent: Agent, episodes=1000, batch_size=64):
    epsilon = agent.epsilon + 0.1
    gamma = agent.gamma
    for i in tqdm(range(episodes)):
        epsilon -= agent.epsilon/episodes
        agent.gamma = gamma - gamma/(i+1)
        state, _ = env.reset()
        done = False
        count_direction = 0
        while not done:
            # 1. make a move in game.
            tradeoff = random.uniform(0,1)
            if tradeoff > epsilon:
                action = agent.model.predictBestAction(state).item()
            else:
                action = env.action_space.sample()
            # Take the action (a) and observe the outcome state(s') and reward (r)
            new_state, reward, done, _, _ = env.step(action)
            
            # if I change the direction, initialize counter
            if state[2] * new_state[2] < 0:
                count_direction = 0
            count_direction += 1
            # If I mantain a positive or negative angle for too long remove reward
            if count_direction > 5:
                reward = 0

            # 2. have the agent remember stuff.
            agent.remember(state, action, reward, new_state, done)

            # 3. update state
            state = new_state

            # 4. if we have enough experiences in our memory, learn from a batch with replay.
            if len(agent.memory) >= batch_size:
                agent.replay(batch_size)
            
            
    env.close()

### Putting it together
---
We train an agent on the environment:


In [58]:
env = gym.make('CartPole-v1' , render_mode='human')
agent = Agent(env.observation_space.shape[0], env.action_space.n)
train(env, agent, 500, 64)
torch.save(agent.model.state_dict(), 'modelCartPole.pth')

100%|██████████| 500/500 [39:25<00:00,  4.73s/it]


Test his performance

In [55]:
import time

def test(env, agent: Agent, episodes = 10):
    start = time.time()
    states = 0
    for _ in tqdm(range(episodes)):
        state, _ = env.reset()
        done = False
        while not done:
            # Make a move in game
            action = agent.model.predictBestAction(state).item()
            new_state, _, done, _, _ = env.step(action)
            # Update state
            state = new_state
            states += 1

    end = time.time()
    avg_time = (end - start) / episodes
    avg_states = (states) / episodes
    env.close()
    return avg_time, avg_states



In [None]:
env = gym.make('CartPole-v1', render_mode='human')
agent = Agent(env.observation_space.shape[0], env.action_space.n)
agent.model.load_state_dict(torch.load('modelCartPole.pth'))
agent.model.eval()
avg_test, avg_states = test(env, agent, 10)
print("Average time for this model is:", avg_test)
print("Average states for this model is:", avg_states)

# stopped after 10 minutes

Compare with random play

In [9]:
def rnd_game_test(env, episodes=10):
    start = time.time()
    for _ in tqdm(range(episodes)):
        state, _ = env.reset()
        done = False
        while not done:
            # Make a move in game
            action = env.action_space.sample()
            new_state, _, done, _, _ = env.step(action)
            # Update state
            state = new_state

    end = time.time()
    avg_time = (end - start) / episodes
    env.close()
    return avg_time

env = gym.make('CartPole-v1', render_mode='human')
rnd_test = rnd_game_test(env, 10)
print("Average time for this model is:", avg_test)
print("Average time for random play is:", rnd_test)

100%|██████████| 10/10 [00:04<00:00,  2.39it/s]

Average time for this model is: 1.0760772705078125
Average time for random play is: 0.4184708833694458





## Optional (highly recommended): Atari
Adapt your agent to play an Atari game of your choice.
https://www.gymlibrary.dev/environments/atari/air_raid/