## The Origins of Reinforcement Learning

When it comes to analyzing the origins of modern Reinforcement Learning, there are three separate starting points that eventually merge to form what we know today: Optimal Control, Trial and Error through Animal Learning, and, less prevalent, Temporal-Difference Methods. First we will be starting with optimal control.

Around the mid-1950s, Richard Bellman and others tackled the problem of "optimal control", described as minimizing a metric of a constantly changing enviroment over time. By combining the system's own state and a value function, optimized for a certain return goal, they were able to create a functional equation, one that is now known as the Bellman equation.

This marks the beginning of what we now know as dynamic programming, the process of solving complex problems by breaking them down into subproblems and building upon each smaller solved one. Bellman also is credited with creating the Markovian decision process (MDPs) while Ronald Howard added on to MDPs by making the policy iteration method for them.
 
The next major part is trial and error through animal learning, a practice, that according to American spychologist R. S. Woodworth, goes as far back as the late 1850s. One of the first few to truly recognize the concept of trial-and-error was Edward Thorndike, an American psychologist that worked extensively on comparitive psychology and the learning process. He initially stated what is now know as "The Law of Effect", a law that describes the correlation between reinforcing events and choosing actions. Over time, the theory was adapted to and laid the foundations for many professionals in the field, such as Pavlov and B. F. Skinner. 

In 1948, Alan Turing described a "pleasure-pain system," which was expanded on and became the basis for the work of animal psychology and reinforcement learning. 

However, due to a lot of confusion in the previous decades due to people using the words reinforcement learning and other types of learning (such as perceptual and supervised) as synonyms, there was a period of silence where development in the field proved slow. Although, there were some exceptions to this trend. The terms "reinforcement" and "reinforcement learning" were actually used in scientific literature for the first time. This is also the time period where Minksky's paper "Steps Toward Artificial Intelligence" that talked about the problem of "How  do  you  distribute  credit  for  success  among  the  many  decisions  that  may  have  been involved in producing it?". Many topics in this paper are still relevant today. Some other examples are the system STeLLA by John Andreae and MENACE by Donald Michie.

One person in particular who is attributed to reviving the field is Harry Klopf, who recognized that there were characteristics of "adaptive behavior" that were being fully ignored. The idea he proposed was the drive to reach a goal in the enviroment, to have a clear desired outcome and undesired end. Eventually, this push evolved into the official distinction between supervised and reinforcement learning.

As mentioned previously, this is the third and last part regarding the origins of reinforcement learning: temporal-difference learning. The origins of this concept can be attributed to animal learning psychology, specifically in the idea of secondary reinforcers. A second reinforcer is a stimulus that has been passively associated with with a primary reinforer and thus has a similar effect. 

More information can be found in:
- **Reinforcement Learning: An Introduction** 
2nd Edition Completed Draft, by Richard S. Sutton and Andrew G. Barto

In 1989, Chris Watkin's thesis converged the major parts discussed before into developing Q-Learning.



## States and Actions

![alt text](images/states_actions.png)

The first core concept we will cover is the understanding of what states and actions are. Reinforcement learning is a type of machine learning that is agent-oriented; it relies on its enviroment rather than a teacher to achieve its desired goal. This is similar to how humans learn, through the steps of trial and error.

Let's take for example a person learning to navigate a maze. A state can compropise of any crossroad they are met with, an action is defined as a choice/direction they choose to go, and the goal (reward) is defined as them reaching the end of the maze.

As the person navigates the maze, they will naturally discover that some paths are less optimal than others, while some do not ever reach the end. Ideally, over time, they would be able to navigate the most optimal path every time. And this is what we are trying to achieve.

## Markov Decision Process 

Building on top of states and actions is the next step, a Markov Decision Process (MDP). A MDP can be simplified to a tuple containing 5 parts:
   
S - set of states   
A - set of actions   
P - probability that an action *a* at state *s* at time *t* will get to state *s + 1* at time *t + 1*   
R - reward received after moving from state *s* to state *s + 1*   
$\gamma$ - discount factor that can optimize future rewards vs present rewards
   
Each of these play a role in determining a final "policy" $\pi$; a rule that says given a state *s*, action *a* will be taken.

![alt text](images/markov.png)

This is the standard relationship between an Agent and the Enviroment in a MDP. An agent is the one who learns and makes decision while the enviroment is everything outside of the agent. These two variables constantly interact and feed each other data, with the enviroment supplying the agent with rewards and the agent triggering the effects of the enviroment.

*(Not sure if I want to include this or not, talks about why the distribution of S and R is only dependent on the prior state and action values).*
$$p({s',r | s,a}) \doteq P\{S_t = s', R_t = r | S_{s-1} = s , A_{t-1} = a\}$$
   
What this equation states, is that in a *finite* MDP, there are a limited number of states, actions, and rewards. Because of this, we can discern that the random variables R and S have a probability distribution based only 

# Cross Entropy Method

https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter04/01_cartpole.py

In [1]:
import gym
from collections import namedtuple
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim


HIDDEN_SIZE = 128
BATCH_SIZE = 16
PERCENTILE = 70


class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions)
        )

    def forward(self, x):
        return self.net(x)


Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

In [8]:
def iterate_batches(env, net, batch_size):
    # this function is called to generate trainign batches
    # as discussed in lecture, the algorithm will 
    # try a number of random batches and return the rewards for each batch
    # Once the total number of batches has been sampled, 
    #   we yield them for training (in a for loop below)
    
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    sm = nn.Softmax(dim=1)
    while True:
        # cast to tensor
        obs_v = torch.FloatTensor([obs])
        # get network prbabilities
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        # generate an action based on probability
        action = np.random.choice(len(act_probs), p=act_probs)
        # take action in the environment and save obs, rewards, action
        next_obs, reward, is_done, _ = env.step(action)
        episode_reward += reward
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        
        if is_done:
            # at the end of the episode, save the model and response
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            # reset parameters for next episode
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            
            # if we have generated enough episodes for a batch, 
            #  yield them to an iterator
            if len(batch) == batch_size:
                yield batch
                batch = []
        obs = next_obs


def filter_batch(batch, percentile):
    # for each episode, get the reward
    rewards = list(map(lambda s: s.reward, batch))
    # get value of the best rewards
    reward_bound = np.percentile(rewards, percentile)
    reward_mean = float(np.mean(rewards))

    # for the best episodes, add actions/observations as training data
    train_obs = []
    train_act = []
    for example in batch:
        if example.reward >= reward_bound:
            # extend data arrays with obs and desired actions
            train_obs.extend(map(lambda step: step.observation, example.steps))
            train_act.extend(map(lambda step: step.action, example.steps))

    train_obs_v = torch.FloatTensor(train_obs)
    train_act_v = torch.LongTensor(train_act)
    
    return train_obs_v, train_act_v, reward_bound, reward_mean

In [9]:
env = gym.make("CartPole-v0")

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)

In [10]:
for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    # from yielded batch, get the best and use as training data
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    
    # reset gradient calculations in graph
    optimizer.zero_grad()
    action_scores_v = net(obs_v) # get what the network does
    loss_v = objective(action_scores_v, acts_v) # use CE to define best action
    # now back prop the gradient and update
    loss_v.backward()
    optimizer.step()
    
    if iter_no %5==0:
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))

    if reward_m > 199:
        print("Solved!")
        break


0: loss=0.677, reward_mean=16.7, reward_bound=17.5
5: loss=0.628, reward_mean=35.5, reward_bound=45.5
10: loss=0.612, reward_mean=60.4, reward_bound=75.0
15: loss=0.590, reward_mean=95.9, reward_bound=119.0
20: loss=0.564, reward_mean=108.0, reward_bound=122.5
25: loss=0.556, reward_mean=171.6, reward_bound=200.0
30: loss=0.553, reward_mean=192.2, reward_bound=200.0
35: loss=0.536, reward_mean=198.8, reward_bound=200.0
Solved!


In [12]:
from IPython.display import clear_output, display

#env = gym.make("CartPole-v0")

obs = env.reset()
sm = nn.Softmax(dim=1)
is_done = False
while not is_done:
    # convert to tensor
    obs_v = torch.FloatTensor([obs])
    # run through network to action probabilities
    act_probs_v = sm(net(obs_v))
    act_probs = act_probs_v.data.numpy()[0] # convert to numpy
    # sample action according to probabilites
    action = np.random.choice(len(act_probs), p=act_probs)
    # take the action
    obs, reward, is_done, _ = env.step(action)
    
    # display the cart
    clear_output(wait=True)
    result = env.render(mode="not_human")
    display(result)
        

True

In [6]:
env.close()# calling this will end the current environment

# Using Cross Entropy on the Frozen Lake

In [13]:
# Wrap the environment so that we can change default observation
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        # change observation space to one hot encoded version 
        # we do this so that ourneural network can stay the same
        # this defines the vector of length N, with values from 0.0 to 1.0
        # In the gym a box is like a tensor (ugh)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), 
                                                dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res

In [14]:
# Does it work on the frozen lake problem?
env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)

print(obs_size,n_actions)

16 4


In [15]:
env.reset()
env.render()
# S: start, F: frozen, H: Hole, G: goal
# Each action has 33% chance of going left of desired


[41mS[0mFFF
FHFH
FFFH
HFFG


In [17]:
# same code as before, but with a different environment
for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    
    obs_v, acts_v, reward_b, reward_m = filter_batch(batch, PERCENTILE)
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    
    if iter_no %5==0:
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))

    if reward_m > 0.8:
        print("Solved!")
        break
        
    if iter_no > 100:
        print("Failed to converge, reached max number of iterations")
        break

0: loss=1.261, reward_mean=0.1, reward_bound=0.0
5: loss=1.330, reward_mean=0.1, reward_bound=0.0
10: loss=1.376, reward_mean=0.0, reward_bound=0.0
15: loss=1.334, reward_mean=0.1, reward_bound=0.0
20: loss=1.334, reward_mean=0.0, reward_bound=0.0
25: loss=1.421, reward_mean=0.0, reward_bound=0.0
30: loss=1.302, reward_mean=0.0, reward_bound=0.0
35: loss=1.380, reward_mean=0.0, reward_bound=0.0
40: loss=1.283, reward_mean=0.0, reward_bound=0.0
45: loss=1.379, reward_mean=0.0, reward_bound=0.0
50: loss=1.352, reward_mean=0.0, reward_bound=0.0
55: loss=1.325, reward_mean=0.0, reward_bound=0.0
60: loss=1.336, reward_mean=0.0, reward_bound=0.0
65: loss=1.296, reward_mean=0.0, reward_bound=0.0
70: loss=1.293, reward_mean=0.0, reward_bound=0.0
75: loss=1.256, reward_mean=0.0, reward_bound=0.0
80: loss=1.134, reward_mean=0.1, reward_bound=0.0
85: loss=1.247, reward_mean=0.1, reward_bound=0.0
90: loss=1.265, reward_mean=0.1, reward_bound=0.0
95: loss=1.261, reward_mean=0.0, reward_bound=0.0
10

____

**Why was this not working?***

What can be done to solve this? let's try using experience replay to get more stable estaimates.

In [11]:
# what can be done???

# Let's make gamma not 1
GAMMA = 0.9 
def filter_batch_gamma(batch, percentile):
    disc_rewards = list(map(lambda s: s.reward * (GAMMA ** len(s.steps)), batch))
    reward_bound = np.percentile(disc_rewards, percentile)

    train_obs = []
    train_act = []
    elite_batch = []
    for example, discounted_reward in zip(batch, disc_rewards):
        if discounted_reward > reward_bound:
            train_obs.extend(map(lambda step: step.observation, example.steps))
            train_act.extend(map(lambda step: step.action, example.steps))
            elite_batch.append(example)

    return elite_batch, train_obs, train_act, reward_bound

In [21]:
# Also add some older batches into the mix (experience replay)

HIDDEN_SIZE = 128
BATCH_SIZE = 100 # was 16
PERCENTILE = 30 # was 70

net = Net(obs_size, HIDDEN_SIZE, n_actions)
objective = nn.CrossEntropyLoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.0001)

for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
    reward_m = float(np.mean(list(map(lambda s: s.reward, batch))))
    full_batch, obs, acts, reward_b = filter_batch_gamma(full_batch + batch, PERCENTILE)
    
    if not full_batch:
        continue
        
    if len(full_batch)>500:    
        full_batch = full_batch[-500:]
    
    obs_v = torch.FloatTensor(obs)
    acts_v = torch.LongTensor(acts)

    
    optimizer.zero_grad()
    action_scores_v = net(obs_v)
    loss_v = objective(action_scores_v, acts_v)
    loss_v.backward()
    optimizer.step()
    
    if iter_no % 200 ==0:
        print("%d: loss=%.3f, reward_mean=%.1f, reward_bound=%.1f" % (
            iter_no, loss_v.item(), reward_m, reward_b))
    
    counter += 1

    if reward_m > 0.8:
        print("Solved!")
        break
        
    if iter_no > 10000:
        print("Failed to converge, reached max number of iterations")
        break

0: loss=1.365, reward_mean=0.0, reward_bound=0.0
200: loss=1.347, reward_mean=0.0, reward_bound=0.2
400: loss=1.288, reward_mean=0.0, reward_bound=0.3
600: loss=1.226, reward_mean=0.0, reward_bound=0.0
800: loss=1.172, reward_mean=0.0, reward_bound=0.2
1000: loss=1.104, reward_mean=0.0, reward_bound=0.3
1200: loss=1.077, reward_mean=0.1, reward_bound=0.4
1401: loss=1.052, reward_mean=0.1, reward_bound=0.2
1601: loss=1.027, reward_mean=0.0, reward_bound=0.3
1801: loss=0.987, reward_mean=0.1, reward_bound=0.4
2002: loss=0.961, reward_mean=0.1, reward_bound=0.3
2202: loss=0.962, reward_mean=0.0, reward_bound=0.0
2402: loss=0.933, reward_mean=0.1, reward_bound=0.4
2603: loss=0.908, reward_mean=0.1, reward_bound=0.0
2803: loss=0.865, reward_mean=0.1, reward_bound=0.2
3003: loss=0.862, reward_mean=0.1, reward_bound=0.4
3204: loss=0.855, reward_mean=0.1, reward_bound=0.4
3404: loss=0.839, reward_mean=0.1, reward_bound=0.2
3604: loss=0.843, reward_mean=0.1, reward_bound=0.4
3805: loss=0.773, r

KeyboardInterrupt: 

Hmm... It seems like even this simple problem is hard for cross entropy to solve. Perhaps we should go back to the basics of learning optimal policies? Yes! Let's see about value iteration.

**[Back to Slides]**

# Basics of Value Iteration

https://github.com/Shmuma/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter05/01_frozenlake_v_iteration.py

In [19]:
import gym
import collections

ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
TEST_EPISODES = 20


class Val_Agent:
    def __init__(self, env):
        self.env = env
        # init reward, transitions, and value function
        self.state = self.env.reset()
        
        # we will use dictionaries to be efficient
        self.rewards = collections.defaultdict(float)
        self.transits = collections.defaultdict(collections.Counter)
        self.values = collections.defaultdict(float)

    def play_n_random_steps(self, count):
        # play this and save the observed rewards and actions
        for _ in range(count):
            # randomly sample the space
            # can get computational here, especially if we keep failing
            action = self.env.action_space.sample()  
            new_state, reward, is_done, _ = self.env.step(action)
            # track the reward from this action and states
            self.rewards[(self.state, action, new_state)] = reward
            # keep track of rewards to estimate p_{a,s\rightarrow s'}
            self.transits[(self.state, action)][new_state] += 1
            # reset if the steps 
            self.state = self.env.reset() if is_done else new_state

    def calc_action_value(self, state, action):
        # get best action from Values
        target_counts = self.transits[(state, action)]
        total = sum(target_counts.values())
        action_value = 0.0
        for tgt_state, count in target_counts.items():
            reward = self.rewards[(state, action, tgt_state)]
            # action=\sum p_{a,s\rightarrow s'}(r+\gamma V(s'))
            action_value += (count / total) * (reward + GAMMA * self.values[tgt_state])
        return action_value

    def select_action(self, state):
        # for each action, get Value of next state and reward, then choose the best
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.calc_action_value(state, action)
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action

    def play_episode(self, render=False):
        total_reward = 0.0
        state = self.env.reset()
        while True:
            # follow our policy based on Value
            action = self.select_action(state)
            new_state, reward, is_done, _ = self.env.step(action)
            self.rewards[(state, action, new_state)] = reward
            self.transits[(state, action)][new_state] += 1
            total_reward += reward
            if render:
                self.env.render()
            if is_done:
                break
            state = new_state
        return total_reward

    def value_iteration(self):
        # update all the values
        for state in range(self.env.observation_space.n):
            state_values = [self.calc_action_value(state, action)
                            for action in range(self.env.action_space.n)]
            self.values[state] = max(state_values)

In [20]:
env = gym.make(ENV_NAME)
agent = Val_Agent(env)

iter_no = 0
best_reward = 0.0
while True:
    iter_no += 1
    agent.play_n_random_steps(100)
    agent.value_iteration()

    reward = 0.0
    for _ in range(TEST_EPISODES):
        reward += agent.play_episode()
    reward /= TEST_EPISODES
    if reward > best_reward:
        print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
        best_reward = reward
    if reward > 0.80:
        print("Solved in %d iterations!" % iter_no)
        break


Best reward updated 0.000 -> 0.100
Best reward updated 0.100 -> 0.300
Best reward updated 0.300 -> 0.350
Best reward updated 0.350 -> 0.500
Best reward updated 0.500 -> 0.600
Best reward updated 0.600 -> 0.650
Best reward updated 0.650 -> 0.700
Best reward updated 0.700 -> 0.850
Solved in 112 iterations!


In [22]:
agent.play_episode(render=True)

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FF[41mF[0mH
HFFG
  (Down)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FF[41mF[0mH
HFFG
  (Down)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH

1.0

# Basics of Value Iteration with Q-Function


In [24]:
class QAgent(Val_Agent):

    def select_action(self, state):
        best_action, best_value = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_action
    
    def value_iteration(self):
        for state in range(self.env.observation_space.n):
            for action in range(self.env.action_space.n):
                action_value = 0.0
                target_counts = self.transits[(state, action)]
                total = sum(target_counts.values())
                for tgt_state, count in target_counts.items():
                    reward = self.rewards[(state, action, tgt_state)]
                    best_action = self.select_action(tgt_state)
                    action_value += (count / total) * (reward + GAMMA * self.values[(tgt_state, best_action)])
                self.values[(state, action)] = action_value

In [26]:
agent = QAgent(env)

iter_no = 0
best_reward = 0.0
while True:
    iter_no += 1
    agent.play_n_random_steps(100)
    agent.value_iteration()

    reward = 0.0
    for _ in range(TEST_EPISODES):
        reward += agent.play_episode()
    reward /= TEST_EPISODES
    if reward > best_reward:
        print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
        best_reward = reward
    if reward > 0.8:
        print("Solved in %d iterations!" % iter_no)
        break



Best reward updated 0.000 -> 0.350
Best reward updated 0.350 -> 0.400
Best reward updated 0.400 -> 0.450
Best reward updated 0.450 -> 0.650
Best reward updated 0.650 -> 0.800
Best reward updated 0.800 -> 0.850
Solved in 22 iterations!


In [31]:
agent.play_episode(render=True)

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Down)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m


1.0

# Basics of Tabular Q-learning

## Q-Learning

This brings us to building our first algorithm, Q-Learning. Given a state s, and an action a, the Q function returns an estimate of the total reward starting from s and taking a.

Let's go over the formula:   
 
$$Q({s_t, a_t}) \leftarrow (1-\alpha)\cdot Q({s_t, a_t}) + \alpha[r_{t+1} +\gamma\max_a Q(s_{t+1}, a) - Q(s_t, a_t)]$$
   
$\alpha$ - the learning rate, typically a small value between 0 and 1, indicates how much we update over values every time we take an action. Typically this value tends to be smaller in order not to overrepresent certain action. However it can also be 1, so that the $Q(s_t, a_t)$ terms cancel out (this is done in DQN).
    
$\gamma$ - discount factor, encourages an agent to seek a reward sooner than later, typically set between .9 and .99. This makes agents receive a smaller reward in the present to give better incentive for future rewards. The effect of the discount factor can be seen when the Bellman equation is expanded, and $\alpha = 1$.

$$Q({s_t, a_t}) = r_0 + \gamma r_1 + \gamma^2 r_2 + \gamma^3 r_ 3 ... $$
$$Q({s_t, a_t}) = r_0 + \gamma(r_1 + \gamma^2 r_2 + \gamma^3 r_ 3 ...) = r_0 + \gamma\max_a Q(s_{t+1}, a )$$

<img src="images/q.png" width="400">

Given this formula, you need the apply it using the following steps:
1. Set initial value of *Q(s, a)* to all arbitrary values.   
2. Eventually while reaching the limit, make sure to do all actions *a* for all states *s*.
3. At each time *t*, change one element.
4. You could reduce the $\alpha$ element over time for optimization purposes.   



In [8]:
import gym
import collections

GAMMA = 0.9
ALPHA = 0.2
TEST_EPISODES = 20


class QLearningAgent:
    def __init__(self,env):
        self.env = env
        self.state = self.env.reset()
        self.values = collections.defaultdict(float)

    def sample_env(self):
        action = self.env.action_space.sample()
        old_state = self.state
        new_state, reward, is_done, _ = self.env.step(action)
        self.state = self.env.reset() if is_done else new_state
        return (old_state, action, reward, new_state)

    def best_value_and_action(self, state):
        best_value, best_action = None, None
        for action in range(self.env.action_space.n):
            action_value = self.values[(state, action)]
            if best_value is None or best_value < action_value:
                best_value = action_value
                best_action = action
        return best_value, best_action

    def value_update(self, s, a, r, next_s):
        best_v, _ = self.best_value_and_action(next_s)
        new_val = r + GAMMA * best_v
        old_val = self.values[(s, a)]
        self.values[(s, a)] = old_val * (1-ALPHA) + new_val * ALPHA

    def play_episode(self, env, render=False):
        total_reward = 0.0
        state = env.reset()
        while True:
            _, action = self.best_value_and_action(state)
            new_state, reward, is_done, _ = env.step(action)
            total_reward += reward
            
            if render:
                env.render()
                
            if is_done:
                break
            state = new_state
        return total_reward

In [9]:
ENV_NAME = "FrozenLake-v0"
test_env = gym.make(ENV_NAME)
train_env = gym.make(ENV_NAME)
agent = QLearningAgent(train_env)

iter_no = 0
best_reward = 0.0
while True:
    iter_no += 1
    s, a, r, next_s = agent.sample_env()
    agent.value_update(s, a, r, next_s)

    reward = 0.0
    for _ in range(TEST_EPISODES):
        reward += agent.play_episode(test_env)
        
    reward /= TEST_EPISODES
    if reward > best_reward:
        print("Best reward updated %.3f -> %.3f" % (best_reward, reward))
        best_reward = reward
        
    if reward > 0.80:
        print("Solved in %d iterations!" % iter_no)
        break

Best reward updated 0.000 -> 0.150
Best reward updated 0.150 -> 0.350
Best reward updated 0.350 -> 0.400
Best reward updated 0.400 -> 0.450
Best reward updated 0.450 -> 0.500
Best reward updated 0.500 -> 0.550
Best reward updated 0.550 -> 0.750
Best reward updated 0.750 -> 0.800
Best reward updated 0.800 -> 0.850
Solved in 3188 iterations!


In [10]:
agent.play_episode(test_env, render=True)

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FF

1.0

**[Back To Slides]**


# DQN (Deep Q Network)

In 2013, researchers at DeepMind presented one of the first models combining reinforcement learning with a convolutional neural network. Using a neural network, they approximated the Q function, with the state being pixels from the Atari 2600. This model was able to outperform all previous approaches on playing six of the games, and outperforms human experts on three of the games.


## Network Training
Frames are cropped to 84x84 regions that capture the game playing area and converted to grayscale, then 4 frames are stacked to capture movement at each step. The resulting stack of frames is used as the state at each step.

The network is then trained with RMSprop on the mean squared error of $Q(s)$ computed from the network and the actual reward received.  10,000,000 frames were used to train for each game.

## Experience Replay
Each state, action, reward and new state (known as transitions) are saved in a "replay memory", and at each step, a random sample of transitions are taken to train the network with. This is known as experience replay, and has a few benefits, including greater data efficiency (each state transition is used more than once) and more efficient learning (randomly sampled states are less correlated than sequential states). This also avoids oscillation and divergence, because the current state is not entirely dependent on the model's parameters at that time. The replay memory can holds the last 1,000,000 frames.

## Target Network
The authors of DQN followed up with another technique in which 2 separate Q networks are used, one to train, and one to calculate the target value during training. Every 10,000 steps, the parameters from the trained network are copied over to the target network. This also avoids oscillation and divergence.

Let's perform a similar setup with a toy problem (the Frozen Lake)

In [1]:
import time
import numpy as np
import collections

import torch
import torch.nn as nn
import torch.optim as optim
import gym

GAMMA = 0.9


Experience = collections.namedtuple('Experience', field_names=['state', 'action', 'reward', 'done', 'new_state'])


class ExperienceBuffer:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        return len(self.buffer)

    def append(self, experience):
        self.buffer.append(experience)

    def sample(self, batch_size):
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)
        states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
        return np.array(states), np.array(actions), np.array(rewards, dtype=np.float32), \
               np.array(dones, dtype=np.uint8), np.array(next_states)


class Agent:
    def __init__(self, env, exp_buffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self._reset()

    def _reset(self):
        self.state = env.reset()
        self.total_reward = 0.0

    def play_step(self, net, epsilon=0.0, device="cpu"):
        done_reward = None

        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state_a = np.array([self.state], copy=False)
            state_v = torch.tensor(state_a).to(device)
            q_vals_v = net(state_v)
            _, act_v = torch.max(q_vals_v, dim=1)
            action = int(act_v.item())

        # do step in the environment
        new_state, reward, is_done, _ = self.env.step(action)
        self.total_reward += reward
        new_state = new_state

        exp = Experience(self.state, action, reward, is_done, new_state)
        self.exp_buffer.append(exp)
        self.state = new_state
        if is_done:
            done_reward = self.total_reward
            self._reset()
        return done_reward


def calc_loss(batch, net, tgt_net, device="cpu"):
    states, actions, rewards, dones, next_states = batch

    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.ByteTensor(dones).to(device)

    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    next_state_values = tgt_net(next_states_v).max(1)[0]
    next_state_values[done_mask] = 0.0
    
    next_state_values = next_state_values.detach()

    expected_state_action_values = next_state_values * GAMMA + rewards_v
    return nn.MSELoss()(state_action_values, expected_state_action_values)



    

In [2]:
# Same as wrapper code from above
# Wrap the environment so that we can change default observation
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        # change observation space to one hot encoded version 
        # we do this so that ourneural network can stay the same
        # this defines the vector of length N, with values from 0.0 to 1.0
        # In the gym a box is like a tensor (ugh)
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), 
                                                dtype=np.float32)

    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        return res
    


In [3]:
DEFAULT_ENV_NAME = "FrozenLake-v0"

env = DiscreteOneHotWrapper(gym.make(DEFAULT_ENV_NAME))

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n

print(obs_size,n_actions)

16 4


In [4]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, int(hidden_size/2)),
            nn.ReLU(),
            nn.Linear(int(hidden_size/2), n_actions)
        )

    def forward(self, x):
        return self.net(x)



HIDDEN_SIZE = 256

net = Net(obs_size, HIDDEN_SIZE, n_actions)
tgt_net = Net(obs_size, HIDDEN_SIZE, n_actions)
print(net)


print(obs_size,n_actions)

device = "cpu"

EPSILON_DECAY_LAST_FRAME = 10**5
EPSILON_START = 1.0
EPSILON_FINAL = 0.0

MEAN_REWARD_BOUND = 0.8
SYNC_TARGET_FRAMES = 50
BATCH_SIZE = 16
REPLAY_SIZE = 500
REPLAY_START_SIZE = 500
LEARNING_RATE = 1e-4

buffer = ExperienceBuffer(REPLAY_SIZE)
agent = Agent(env, buffer)
epsilon = EPSILON_START

optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
total_rewards = []
frame_idx = 0
ts_frame = 0
ts = time.time()
best_mean_reward = None

while True:
    frame_idx += 1
    epsilon = max(EPSILON_FINAL, EPSILON_START - frame_idx / EPSILON_DECAY_LAST_FRAME)

    reward = agent.play_step(net, epsilon, device=device)
    if reward is not None:
        total_rewards.append(reward)
        ts_frame = frame_idx
        
        mean_reward = np.mean(total_rewards[-100:])
        if frame_idx % 100==0:
            print("%d: done %d iterations, mean reward %.3f, eps %.2f" % (
                frame_idx, len(total_rewards), mean_reward, epsilon
            ))
        
        if best_mean_reward is None or best_mean_reward < mean_reward:
            torch.save(net.state_dict(), "model-best.dat")
            if best_mean_reward is not None:
                print("Best mean reward updated %.3f -> %.3f, model saved" % (best_mean_reward, mean_reward))
            best_mean_reward = mean_reward
        if mean_reward > 0.8:
            print("Solved in %d frames!" % frame_idx)
            break

    if len(buffer) < REPLAY_START_SIZE:
        continue

    if frame_idx % SYNC_TARGET_FRAMES == 0:
        tgt_net.load_state_dict(net.state_dict())

    optimizer.zero_grad()
    batch = buffer.sample(BATCH_SIZE)
    loss_t = calc_loss(batch, net, tgt_net, device=device)
    loss_t.backward()
    optimizer.step()


Net(
  (net): Sequential(
    (0): Linear(in_features=16, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=128, bias=True)
    (3): ReLU()
    (4): Linear(in_features=128, out_features=4, bias=True)
  )
)
16 4
Best mean reward updated 0.000 -> 0.077, model saved
300: done 35 iterations, mean reward 0.029, eps 1.00
400: done 51 iterations, mean reward 0.039, eps 1.00
3000: done 392 iterations, mean reward 0.030, eps 0.97
4000: done 529 iterations, mean reward 0.030, eps 0.96
6700: done 858 iterations, mean reward 0.010, eps 0.93
7100: done 912 iterations, mean reward 0.020, eps 0.93
7900: done 1021 iterations, mean reward 0.000, eps 0.92
8000: done 1037 iterations, mean reward 0.000, eps 0.92
8200: done 1063 iterations, mean reward 0.010, eps 0.92
8900: done 1145 iterations, mean reward 0.010, eps 0.91
9000: done 1161 iterations, mean reward 0.000, eps 0.91
9500: done 1227 iterations, mean reward 0.010, eps 0.91
9800: done 1269 iterations, mean 

Best mean reward updated 0.580 -> 0.590, model saved
Best mean reward updated 0.590 -> 0.600, model saved
Best mean reward updated 0.600 -> 0.610, model saved
Best mean reward updated 0.610 -> 0.620, model saved
103300: done 8920 iterations, mean reward 0.620, eps 0.00
Best mean reward updated 0.620 -> 0.630, model saved
Best mean reward updated 0.630 -> 0.640, model saved
Best mean reward updated 0.640 -> 0.650, model saved
Best mean reward updated 0.650 -> 0.660, model saved
Best mean reward updated 0.660 -> 0.670, model saved
Best mean reward updated 0.670 -> 0.680, model saved
Best mean reward updated 0.680 -> 0.690, model saved
Best mean reward updated 0.690 -> 0.700, model saved
Best mean reward updated 0.700 -> 0.710, model saved
Best mean reward updated 0.710 -> 0.720, model saved
Best mean reward updated 0.720 -> 0.730, model saved
Best mean reward updated 0.730 -> 0.740, model saved
108400: done 9053 iterations, mean reward 0.690, eps 0.00
112700: done 9163 iterations, mean r

# DQNs with Larger State Spaces

Well it does seem like th Deep Q Network learned, but it required many more hyper parameter tunings than our previous work and seemed very bittle with respoect to the epsilon parameter and replay buffer size. 

In fact, these are all significant downsides to the use of the Deep-Q network. For small state spaces, DQNs are not terribly advantageous. When the state space becomes intractible, however, is when they really shine---like when the state space is continuous.

In [5]:
env = gym.make("CartPole-v0")

obs_size = env.observation_space.shape[0]
n_actions = env.action_space.n


In [7]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, int(hidden_size/2)),
            nn.ReLU(),
            nn.Linear(int(hidden_size/2), n_actions)
        )

    def forward(self, x):
        return self.net(x)



HIDDEN_SIZE = 64

net = Net(obs_size, HIDDEN_SIZE, n_actions)
tgt_net = Net(obs_size, HIDDEN_SIZE, n_actions)
print(net)


print(obs_size,n_actions)

device = "cpu"

EPSILON_DECAY_LAST_FRAME = 10**5
EPSILON_START = 1.0
EPSILON_FINAL = 0.0

MEAN_REWARD_BOUND = 200
SYNC_TARGET_FRAMES = 50
BATCH_SIZE = 16
REPLAY_SIZE = 500
REPLAY_START_SIZE = 500
LEARNING_RATE = 1e-4

buffer = ExperienceBuffer(REPLAY_SIZE)
agent = Agent(env, buffer)
epsilon = EPSILON_START

optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
total_rewards = []
frame_idx = 0
ts_frame = 0
ts = time.time()
best_mean_reward = None

while True:
    frame_idx += 1
    epsilon = max(EPSILON_FINAL, EPSILON_START - frame_idx / EPSILON_DECAY_LAST_FRAME)

    reward = agent.play_step(net, epsilon, device=device)
    if reward is not None:
        total_rewards.append(reward)
        ts_frame = frame_idx
        
        mean_reward = np.mean(total_rewards[-100:])
        if frame_idx % 100==0:
            print("%d: done %d iterations, mean reward %.3f, eps %.2f" % (
                frame_idx, len(total_rewards), mean_reward, epsilon
            ))
        
        if best_mean_reward is None or best_mean_reward < mean_reward:
            torch.save(net.state_dict(), "model-best.dat")
            if best_mean_reward is not None:
                print("Best mean reward updated %.3f -> %.3f, model saved" % (best_mean_reward, mean_reward))
            best_mean_reward = mean_reward
        if mean_reward > 200:
            print("Solved in %d frames!" % frame_idx)
            break

    if len(buffer) < REPLAY_START_SIZE:
        continue

    if frame_idx % SYNC_TARGET_FRAMES == 0:
        tgt_net.load_state_dict(net.state_dict())

    optimizer.zero_grad()
    batch = buffer.sample(BATCH_SIZE)
    loss_t = calc_loss(batch, net, tgt_net, device=device)
    loss_t.backward()
    optimizer.step()



Net(
  (net): Sequential(
    (0): Linear(in_features=4, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=32, bias=True)
    (3): ReLU()
    (4): Linear(in_features=32, out_features=2, bias=True)
  )
)
4 2
Best mean reward updated 17.000 -> 17.286, model saved
Best mean reward updated 17.286 -> 17.300, model saved
Best mean reward updated 17.300 -> 20.455, model saved
Best mean reward updated 20.455 -> 20.889, model saved


RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #4 'mat1'

# Deep Q Learning with Atari Games

https://github.com/Shmuma/Deep-Reinforcement-Learning-Hands-On/blob/master/Chapter06/02_dqn_pong.py

In [None]:
from q_learn_utils import make_env
from q_learn_utils import DQN

import time
import numpy as np
import collections

import torch
import torch.nn as nn
import torch.optim as optim


DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
MEAN_REWARD_BOUND = 19.5

GAMMA = 0.99
BATCH_SIZE = 32
REPLAY_SIZE = 10000
LEARNING_RATE = 1e-4
SYNC_TARGET_FRAMES = 1000
REPLAY_START_SIZE = 10000

EPSILON_DECAY_LAST_FRAME = 10**5
EPSILON_START = 1.0
EPSILON_FINAL = 0.02


Experience = collections.namedtuple('Experience', 
                                    field_names=['state', 'action', 'reward', 'done', 'new_state'])


class ExperienceBuffer:
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)

    def __len__(self):
        return len(self.buffer)

    def append(self, experience):
        self.buffer.append(experience)

    def sample(self, batch_size):
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)
        states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
        return np.array(states), np.array(actions), np.array(rewards, dtype=np.float32), \
               np.array(dones, dtype=np.uint8), np.array(next_states)


class Agent:
    def __init__(self, env, exp_buffer):
        self.env = env
        self.exp_buffer = exp_buffer
        self._reset()

    def _reset(self):
        self.state = env.reset()
        self.total_reward = 0.0

    def play_step(self, net, epsilon=0.0, device="cpu"):
        done_reward = None

        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            state_a = np.array([self.state], copy=False)
            state_v = torch.tensor(state_a).to(device)
            q_vals_v = net(state_v)
            _, act_v = torch.max(q_vals_v, dim=1)
            action = int(act_v.item())

        # do step in the environment
        new_state, reward, is_done, _ = self.env.step(action)
        self.total_reward += reward
        new_state = new_state

        exp = Experience(self.state, action, reward, is_done, new_state)
        self.exp_buffer.append(exp)
        self.state = new_state
        if is_done:
            done_reward = self.total_reward
            self._reset()
        return done_reward


def calc_loss(batch, net, tgt_net, device="cpu"):
    states, actions, rewards, dones, next_states = batch

    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.ByteTensor(dones).to(device)

    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    next_state_values = tgt_net(next_states_v).max(1)[0]
    next_state_values[done_mask] = 0.0
    next_state_values = next_state_values.detach()

    expected_state_action_values = next_state_values * GAMMA + rewards_v
    return nn.MSELoss()(state_action_values, expected_state_action_values)



device = "cpu"

env = make_env(DEFAULT_ENV_NAME)

net = DQN(env.observation_space.shape, env.action_space.n).to(device)
tgt_net = DQN(env.observation_space.shape, env.action_space.n).to(device)
print(net)

buffer = ExperienceBuffer(REPLAY_SIZE)
agent = Agent(env, buffer)
epsilon = EPSILON_START

optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
total_rewards = []
frame_idx = 0
ts_frame = 0
ts = time.time()
best_mean_reward = None


In [3]:
#continue training (no resets of the Agent or training values)
while True:
    frame_idx += 1
    epsilon = max(EPSILON_FINAL, EPSILON_START - frame_idx / EPSILON_DECAY_LAST_FRAME)

    reward = agent.play_step(net, epsilon, device=device)
    if reward is not None:
        total_rewards.append(reward)
        speed = (frame_idx - ts_frame) / (time.time() - ts)
        ts_frame = frame_idx
        ts = time.time()
        mean_reward = np.mean(total_rewards[-100:])
        print("%d: done %d games, mean reward %.3f, eps %.2f, speed %.2f f/s" % (
            frame_idx, len(total_rewards), mean_reward, epsilon,
            speed
        ))
        
        if best_mean_reward is None or best_mean_reward < mean_reward:
            torch.save(net.state_dict(), DEFAULT_ENV_NAME + "-best.dat")
            if best_mean_reward is not None:
                print("Best mean reward updated %.3f -> %.3f, model saved" % (best_mean_reward, mean_reward))
            best_mean_reward = mean_reward
        if mean_reward > MEAN_REWARD_BOUND:
            print("Solved in %d frames!" % frame_idx)
            break

    if len(buffer) < REPLAY_START_SIZE:
        continue

    if frame_idx % SYNC_TARGET_FRAMES == 0:
        tgt_net.load_state_dict(net.state_dict())

    optimizer.zero_grad()
    batch = buffer.sample(BATCH_SIZE)
    loss_t = calc_loss(batch, net, tgt_net, device=device)
    loss_t.backward()
    optimizer.step()

420406: done 235 games, mean reward 12.430, eps 0.02, speed 0.72 f/s
Best mean reward updated 12.220 -> 12.430, model saved
422326: done 236 games, mean reward 12.630, eps 0.02, speed 5.20 f/s
Best mean reward updated 12.430 -> 12.630, model saved
424148: done 237 games, mean reward 12.860, eps 0.02, speed 5.54 f/s
Best mean reward updated 12.630 -> 12.860, model saved
426031: done 238 games, mean reward 13.110, eps 0.02, speed 5.52 f/s
Best mean reward updated 12.860 -> 13.110, model saved
428313: done 239 games, mean reward 13.280, eps 0.02, speed 5.49 f/s
Best mean reward updated 13.110 -> 13.280, model saved
430085: done 240 games, mean reward 13.540, eps 0.02, speed 5.61 f/s
Best mean reward updated 13.280 -> 13.540, model saved
432531: done 241 games, mean reward 13.710, eps 0.02, speed 5.34 f/s
Best mean reward updated 13.540 -> 13.710, model saved
434197: done 242 games, mean reward 13.960, eps 0.02, speed 5.68 f/s
Best mean reward updated 13.710 -> 13.960, model saved
435898: 

KeyboardInterrupt: 

In [1]:
import gym
import time
import numpy as np

import torch

from q_learn_utils import make_env
from q_learn_utils import DQN

import collections

DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
RENDER = True

env = make_env(DEFAULT_ENV_NAME)

test_net = DQN(env.observation_space.shape, env.action_space.n)
test_net.load_state_dict(torch.load(DEFAULT_ENV_NAME + "-best_Spring2019.dat", map_location=lambda storage, loc: storage))




Total reward: 20.00
Action counts: Counter({2: 428, 5: 390, 3: 331, 4: 228, 0: 198, 1: 191})


In [4]:
state = env.reset()
total_reward = 0.0
c = collections.Counter()

while True:
    start_ts = time.time()
    if RENDER:
        env.render()
    state_v = torch.tensor(np.array([state], copy=False))
    q_vals = test_net(state_v).data.numpy()[0]
    action = np.argmax(q_vals)
    c[action] += 1
    state, reward, done, _ = env.step(action)
    total_reward += reward
    if done:
        break
    if RENDER:
        delta = 1/60 - (time.time() - start_ts)
        if delta > 0:
            time.sleep(delta)
print("Total reward: %.2f" % total_reward)
print("Action counts:", c)

Total reward: 20.00
Action counts: Counter({2: 428, 5: 390, 3: 331, 4: 228, 0: 198, 1: 191})


In [5]:
env.close()