Skip to content

Policy learning of in-hand manipulation. Proximal policy optimization trains the Allegro hand to learn a stabilizing grasp

Notifications You must be signed in to change notification settings

JonasMedu/learn_grasping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

In-hand manipulation with proximal policy optimization (PPO)

This project aims to train the Allegro hand for in-hand manipulation (in simulation). The robotic Allegro hand is an analogy of human anatomy. Its fingertips measure haptic pressure via tactile information. The real hand makes use of fluids, which are hard to model. This motivates model free reinforcement learning. In this project the Allegro hand learns to find a stabilizing grasp for a small bar. The project defines two experimental setup. The first one aims to find a reward function definition to enable training progression. The second setup simulates a trajectory optimization in form of noise.

framework overview

tensorflow 1.x for machine learning pybullet for the Allegro hand simulation gym(.Env) as superclass for hand

packages overview

ausy_base:

  1. Learning script
  2. PPO implementation
  3. Policy and value function model

allegro_pybullet:

  1. Meshes for the hand and the grasping object
  2. Pybullet interface, manipulation interface for the hand

hand_env:

  1. Gym classes for the Allegro hand. (defines state and action spaces)
    1. Grasping learning hand (allegro_env)
    2. Grasping learning hand under the influence of noise (noisy_hand)
    3. Grasping learning hand which stacks a previously trained (trained_env)

setting_utils:

  1. controls the experiments parameters
  2. holds the positions of the initial state distribution
  3. defines the reward functions

performance_analysis:

  1. holds code to make graphs for the training progression etc.

quick start (curerntly unavailable; I do not have any licence information about the allegro Hand implementation)

  1. install requirements (e.g. via pip), the requirements file
  2. adapt hard coded logging folder in setting_utils.paramhandler
  3. run learn_lower with gui=True
  4. wait for learning progression
    1. You can measure the learning progression by the "number of trajectories" per training data gathering cycle. You can inspect this variable via tensorboard.

implementation notes

  • setting_utils defines tensorboard
  • project ran only on local machines -> no cli arguments, alter source code directly
  • tf v 1. requires messy model saving/loading handling
  • files saved (standard out: documents/tb, new folder per experiment) per run:
    1. progress.csv with the training progress
    2. config file saves the parameter settings
  • pybullet works in one thread. The easiest possibility to speed up the experiments, is to run them in parallel.
  • the show_finger option of the Envs shows the tactile pressure, but slow down the simulation hard
  • because of pybullet, you only render every time step, or none.
  • lots of tf warnings are thrown because of the project age
  • the policy loading and saving is mostly managed via naming conventions

model notes

2-hidden layer (size 64) MLP, for the value function and the policy.

How to enable Logging in tensorboard

run in your comand line tensorboard --logdir=path/to/tensorboard/data
Where as the fixed standard out: "path/to/tensorboard/data" = "~/documents/tb"

Results

After around 600 training and data gathering iterations the policy learns a stabilizing grasp. The following images show a policy trained with the reward function weighted_or_dist.

stable position b stable position c stable position d

The second experiment shows how a simulation of the upper trajectory optimization influences the learning capabilities of the stabilizing grasp. The graph shows the training progression with the influence of noise. The graph shows 20 training sessions, each of which as a different noise (simulated upper trajectory optimization signal) input. The noise is spaced between .1 and 20 in a geometric progression. The color of the lines help to indicate the noise magnitude. The higher the noise, the darker the line. A gradient from bright, at the button, to dark (at the top) appears in the graph.

Each iteration has a fixed number of simulation steps. The lower the number of trajectories, the more stable is the grasp of the bar. The graphs indicates the relation between noise and capability of the agent to learn a stabilizing policy.

About

Policy learning of in-hand manipulation. Proximal policy optimization trains the Allegro hand to learn a stabilizing grasp

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages