Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curriculum learning for grasping environment #62

Closed
10 of 23 tasks
AndrejOrsula opened this issue Mar 10, 2021 · 1 comment
Closed
10 of 23 tasks

Curriculum learning for grasping environment #62

AndrejOrsula opened this issue Mar 10, 2021 · 1 comment
Labels
EPIC 🦄 As epic as it can ever get

Comments

@AndrejOrsula
Copy link
Owner

AndrejOrsula commented Mar 10, 2021

Idea/topic 1: Decouple the entire Grasp task into sub-routines (primitive tasks) and train them individually until fully mastered (success rate and/or reward above certain value)

Here is a list of examples that could fit into such sub-routines. For each of them, there is a corresponding termination state for success, as well as sparse and dense reward (+ their alternatives).

  1. Robot must approach object
    • Termination (success):
      • Distance between robot tool centre point (TCP) and the closest object is less than a threshold
      • Robot finger(s) collide with the object
    • Reward
      • Sparse reward:
        • Constant positive reward once episode is terminated due to success
      • Dense reward:
        • (relative +-) Positive/negative reward based on how much closer/further the robot TCP is to the closest object compared to the previous step
        • (absolute) Negative distance between robot TCP and closest object
  2. Robot must touch the object
    • Termination (success):
      • Any fingers must be in contact with an object
    • Reward
      • Sparse reward:
        • Constant positive reward once episode is terminated due to success
      • Dense reward:
        • ?
  3. Robot must grasp the object
    • Termination (success):
      • Fingers must be in contact with an object (contact normals cannot point in the same direction, i.e. no pushing of object)
      • Same as above, but for X number of consecutive steps
        • This would be preferred, however, agent has currently no temporal information in the observations
    • Reward
      • Sparse reward:
        • Constant positive reward once episode is terminated due to success
      • Dense reward:
        • ? Reward engineering for this one is quite difficult, not to speak about the quality of the grasp.
          • One could give a small reward if a part of object's geometry is located between fingers. Or look at the distance to an object in the direction of each finger's actuated direction. A negative reward could also be given for each step the gripper is closed and not contacting any object.
          • For now, sparse reward might be much more descriptive.
  4. Robot must lift the grasped object
    • Termination (success):
      • An object is lifted above certain height threshold while being in contact with the fingers (contact normals cannot point in the same direction, i.e. no pushing of object)
    • Reward
      • Sparse reward:
        • Constant positive reward once episode is terminated due to success
      • Dense reward:
        • (relative +-) Positive/negative reward based on how much higher/lower an object is compared to the previous step
          • The object must be in contact with fingers.
        • (relative +[-]) Positive/negative reward based on how much higher/lower an object is compared to the previous step
          • The object must be in contact with fingers if it is higher. No negative reward will be given if the object is falling (with no contact).
        • (relative +) Only positive reward based on how much higher an object is compared to the previous step
        • (absolute) Negative distance between robot TCP and closest object
  5. One could be continue from this (or previous) step for other actions, e.g. placing. I am not looking into that in the scope of this project.

Variant a (Selected): Update termination state and reward function to the next sub-task only after the current sub-task performs well

Start training from the first sub-task, including only the termination and reward for such task. After certain reward is accumulate (or success rate achieved), update the termination and reward function to take the next sub-task into account. For the reward, this could be done in two ways:

  • Use only the reward for the last objective and rely on the replay buffer and current policy that the agent keeps reaching the goal of the previous sub-task
  • Keep both the previous and the new reward function
  • Keep both the previous and the new reward function, but down-scale the reward function of previous sub-task by some constant value

Variant b: Design separate tasks (environments) - not really a curriculum learning, but might be of interest

In this variant, each sub-task would have its own task (environment) that would provide adequate starting and termination position. Once each sub-task has relatively high success rate, save transitions from each sub-task into a single replay buffer and use it as demonstration of each step to train the agent from start to finish. This could also be done in an offline fashion.

Advantage of this approach is in the clear separation of the goals that the agent should reach. Disadvantage is that agent might not be as robust and able to learn the connections/transitions between the tasks correctly.

Idea/topic 2: Restrict action space (and workspace) of robot and progressively increase it

Restricting the actions space could aid exploration because less randomness would be required to reach a rewarding state. The workspace can grow with success rate / average reward, or simply as the step gets called

This can be restricted by:

  • Position goal, using a growing axis-aligned bounding box should do the job
    • Volume in which objects are spawned must be adjusted accordingly
  • Orientation goal (only for full 3D), top-down orientation is more likely to result in successful grasp compared to bottom-up

Restricting gripper action probably does not make much sense, not sure what the point would be.

Idea/topic 3: Make the environment progressively more difficult

Apply randomizer (random objects and ground plane) once simpler scenario is solved

@AndrejOrsula
Copy link
Owner Author

First two points are addressed by #65

Making the environment progressively more difficult (multiple, random objects) can be done manually, and it also seems to be the easiest way (performance, easier to determine when to change).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EPIC 🦄 As epic as it can ever get
Projects
None yet
Development

No branches or pull requests

1 participant