Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

td3_implementation analysis #10

Open
CUN-bjy opened this issue Jan 28, 2021 · 7 comments
Open

td3_implementation analysis #10

CUN-bjy opened this issue Jan 28, 2021 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@CUN-bjy
Copy link
Owner

CUN-bjy commented Jan 28, 2021

the first TD3 implementation do not work well..

so.. have to analysis each part of the differences from ddpg

@CUN-bjy CUN-bjy created this issue from a note in gym-td3-keras (In progress) Jan 28, 2021
@CUN-bjy CUN-bjy self-assigned this Jan 28, 2021
@CUN-bjy CUN-bjy added the bug Something isn't working label Jan 28, 2021
@CUN-bjy
Copy link
Owner Author

CUN-bjy commented Jan 28, 2021

[TEST 1]

  • rebase to ddpg
  • add target policy smoothing term.
  • test on RoboschoolInvertedPendulum-v1.
  • batch_size -> 64, hidden_layers -> 24, 16(for each actor and critic)
  • lr-> 1e-4,1e-3, tau->1e-3,1e-3(for each actor and critic)
  • it works.(cause of exploration problem, I tested 5 times more.. success rate is under 30%.)
    ✔️

[TEST 2]

  • basically, tested on InvertedPendulum, almost same to above parameters and added target policy smoothing.
  • change parameters, lr->3e-4,3e-4, tau->5e-3,5e-3 (same to original td3 code)
  • it works.(also have exploration problem, success rate is not high.)
    ✔️

[TEST 3]

  • add delayed policy update term.
  • delayed update peoriod -> 2
  • it works.(have exploration problem, success rate is too low, and feel like delayed to learn a good policy)
    ✔️

@CUN-bjy
Copy link
Owner Author

CUN-bjy commented Jan 28, 2021

[TEST 4]

  • rebase to ddpg
  • add target policy smoothing term.
  • test on RoboschoolInvertedPendulum-v1.
  • batch_size -> 64, hidden_layers -> 24, 16(for each actor and critic)
  • lr->3e-4,3e-4, tau->5e-3,5e-3 (for each actor and critic)
  • (reset policy update interval to 1)
  • add double cliped Q update term
  • but only use Q1 for target update
  • it works.
    ✔️

[TEST 5]

  • same to above experiment.
  • use Q1,Q2 for target update
  • it doesn't work well..

Screenshot from 2021-01-28 19-19-59

Screenshot from 2021-01-28 19-19-07

[TEST 6]

  • test on RoboschoolInvertedPendulum-v1.
  • TD3 set.
  • add target policy smoothing term.
  • add delayed policy update term. update_interval -> 2
  • add double cliped Q update term
  • batch_size -> 64, hidden_layers -> 24, 16(for each actor and critic)
  • lr->3e-4,3e-4, tau->5e-3,5e-3 (for each actor and critic)
  • also doesn't work well..(take long time to learn the policy and getting worse)

👀 I think the my implementation of double cliped q update has some problems.

@CUN-bjy
Copy link
Owner Author

CUN-bjy commented Jan 30, 2021

[TEST 7]

  • test on RoboschoolInvertedPendulum-v1.
  • add double cliped Q update term (only)
  • batch_size -> 64, hidden_layers -> 24, 16(for each actor and critic)
  • lr->3e-4,3e-4, tau->5e-3,5e-3 (for each actor and critic)
  • yes.. doesn't work.. Catastrophic forgetting problem happen, even in one task.

[TEST 8]

  • same condition to above.
  • changed some codes that make independent optimizer for critics
    Screenshot from 2021-02-06 21-47-36
    Screenshot from 2021-02-06 21-47-41
  • I think that is working, but very slow compared to simple ddpg.
  • by double cliped q update overestimation problem is fixed, but that make policy update so slow.
  • should try integrated system and than should change some parameters..

@CUN-bjy
Copy link
Owner Author

CUN-bjy commented Feb 7, 2021

[TEST 9]

  • integrate that all.. -> doesn't work..
    reward
    critic_loss

@CUN-bjy
Copy link
Owner Author

CUN-bjy commented Feb 7, 2021

[TEST 10]

  • initial random policy added for exploration
  • and there's some mistake..

before

		a = agent.make_action(obs,t)
		action = np.argmax(a) if is_discrete else a

		# do step on gym at t-step
		new_obs, reward, done, info = env.step(action) 

		# store the results to buffer	
		agent.memorize(obs, a, reward, done, new_obs) 
                # should've memorize action w/ noise!!

after

		a = agent.make_action(obs,t)
		action = np.argmax(a) if is_discrete else a

		# do step on gym at t-step
		new_obs, reward, done, info = env.step(action) 

		# store the results to buffer	
		agent.memorize(obs, action, reward, done, new_obs)

but, consequently, doesn't work..

@CUN-bjy
Copy link
Owner Author

CUN-bjy commented Feb 24, 2021

[TEST 11]

  • use OU noise process for off-policy exploration strategy
    reward
    critic_loss

also, doesn't work.

@CUN-bjy
Copy link
Owner Author

CUN-bjy commented Feb 25, 2021

[TEST 12]

  • use OU noise process for off-policy exploration strategy
  • without BatchNormalization, Weight Regularization, Groot Initializer on actor and critic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
gym-td3-keras
In progress
Development

No branches or pull requests

1 participant