This is an application of the Deep Deterministic Policy Gradient (DDPG) reinforcement learning algorithm to learn voltage-level control of a two-wheeled Differential Drive mobile robot. Normally, voltage-level control is handled by dedicated controllers running at kilohertz speeds. Then, higher level path planners can command the vehicle using velocity or position setpoints.
In this project, I show a simple MLP with 2 hidden layers, running at 20 Hz, can learn control of a highly nonlinear vehicle. This model operates on a series of 4 85x48 binary images (state space dimensionality: 16320) and outputs a continuous value for each motor. For this project, the model is tasked with facing a square target in the provided binary image. The reward provided is the increase or decrease in robot heading, compared to the heading exactly focused on the target.
This repo 2 main sections:
- DDPG Folder: An existing pytorch implementation of DDPG with slight modifications from this repo.
- Simulator Folder: A custom Differential Drive, DC Brushed Motor, and Pinhole Camera simulator.
First, install dependencies from the requirements.txt. This can be done easily with pip:
pip install -r requirements.txt
For training the model from scratch, simply run the train.py file. Results can optionally be logged to weights and biases with the --wandb flag.
python train.py
To run a pretrained model, first download the checkpoints from here and extract into the project directory. Then run all cells in the evaluate.ipynb notebook. At the bottom, a video showing the model controlling a robot should appear.
DDPG is a rather unstable algorithm. This can result in the policy converging and collapsing repeatedly as shown by the reward graph below. Interestingly, because the task of facing a target is rather open-ended, the algorithm converges on a few different policies over the training period.
Initially, the algorithm performs very poorly, as shown in the following video. The vehicle turns quickly away from the target it should be pointed at, resulting in a high negative reward. Note that the following videos are best viewed in full screen.
epoch_100.mp4
Around epoch 1800, the first useful policy of moving forward and facing the target emerges. Note that this corresponds to a spike in the above reward graph as well:
epoch_1800.mp4
At epoch 2000, this policy changes slightly, resulting in underdamped control of the system:
epoch_2000.mp4
Near epoch 2400, a new policy emerges, with the robot facing the target and driving backwards:
epoch_2400.mp4
At epoch 3000, underdamped behavior of this backwards policy also emerges
