ExperienceReplayBuffer stores second-last transition twice #25

Phlogiston90 · 2019-05-05T14:43:39Z

Hi Maxim,
first of all, thank you so much for the book! It helps me a lot for my thesis!

Second, I think that the ExperienceReplayBuffer stores the second-last transition twice, which could bias the training if an environment only has a few steps (like mine).
Maybe I have overlooked something, but this is my minimal example showing the described behaviour:

import ptan
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gym

EPSILON_START = 1.0
GAMMA = 1.0
REWARD_STEPS = 1
LEARNING_RATE = 0.001
MAX_STEPS = 20
REPLAY_SIZE = 10
MAX_STEPS_PER_EPISODE = 3
device = torch.device("cpu")

class Environment(gym.Env):
    def __init__(self):
        self.state = 0
        self.observation_space = gym.spaces.Discrete(5)
        self.action_space = gym.spaces.Discrete(2)
    
    def reset(self):
        self.state = 0
        return self.state
    
    def step(self, action):
        self.state += 1
        if self.state == 4:
            done = True
        else:
            done = False
        reward = self.state
        return self.state, reward, done, None

class DQN(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(DQN, self).__init__()
        self.n_actions = n_actions
        
    def forward(self, x):
        return torch.rand(1,self.n_actions)

env = Environment()
net = DQN(env.observation_space.shape, env.action_space.n).to(device)
selector = ptan.actions.EpsilonGreedyActionSelector(EPSILON_START)
agent = ptan.agent.DQNAgent(net, selector, device=device)
exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, GAMMA, steps_count=REWARD_STEPS)
buffer = ptan.experience.ExperienceReplayBuffer(exp_source, REPLAY_SIZE)

step_idx = 0

while step_idx < MAX_STEPS:
    step_idx += 1
    buffer.populate(1)
    new_rewards = exp_source.pop_rewards_steps()
    
    if new_rewards:
        print("episode over: step {}: (total_reward, steps) = {}".format(step_idx, new_rewards[0]))
        
print()
print(*buffer.buffer, sep='\n')

The output is:

episode over: step 6: (total_reward, steps) = (10.0, 4)
episode over: step 11: (total_reward, steps) = (10.0, 4)
episode over: step 16: (total_reward, steps) = (10.0, 4)

ExperienceFirstLast(state=0, action=0, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=0, reward=2.0, last_state=2)
ExperienceFirstLast(state=2, action=0, reward=3.0, last_state=3)
ExperienceFirstLast(state=2, action=0, reward=3.0, last_state=3)
ExperienceFirstLast(state=3, action=0, reward=4.0, last_state=None)
ExperienceFirstLast(state=0, action=0, reward=1.0, last_state=1)
ExperienceFirstLast(state=1, action=0, reward=2.0, last_state=2)
ExperienceFirstLast(state=2, action=0, reward=3.0, last_state=3)
ExperienceFirstLast(state=2, action=0, reward=3.0, last_state=3)
ExperienceFirstLast(state=3, action=1, reward=4.0, last_state=None)

Notice how the transition from state 2 to 3 is stored twice each time. I used the ptan version that you can get via pip today. Could you have a look into this?
Best regards!

The text was updated successfully, but these errors were encountered:

Shmuma · 2019-05-06T10:17:30Z

Hi!

Thanks for reporting! Will take a look on this.

#25

Shmuma · 2019-05-06T11:00:00Z

That's indeed a bug, fixed in my dev branch https://github.com/Shmuma/ptan/tree/torch-1.0.1

#25

Shmuma added a commit that referenced this issue May 6, 2019

Fix issue with duplicated pre-done states

47143ba

#25

Shmuma closed this as completed May 6, 2019

Phlogiston90 mentioned this issue May 6, 2019

Re-open "ExperienceReplayBuffer stores second-last transition twice" #26

Closed

Shmuma added a commit that referenced this issue Jul 1, 2019

Fix issue with duplicated pre-done states

e6ee7e8

#25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExperienceReplayBuffer stores second-last transition twice #25

ExperienceReplayBuffer stores second-last transition twice #25

Phlogiston90 commented May 5, 2019

Shmuma commented May 6, 2019

Shmuma commented May 6, 2019

ExperienceReplayBuffer stores second-last transition twice #25

ExperienceReplayBuffer stores second-last transition twice #25

Comments

Phlogiston90 commented May 5, 2019

Shmuma commented May 6, 2019

Shmuma commented May 6, 2019