Skip to content

smtmRadu/DeepUnity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepUnity

In development - does not currently accept Pull Requests, though feel free to Fork and expand upon it

version

DeepUnity is an add-on framework that provides tensor computation [with GPU acceleration support] and deep neural networks, along with reinforcement learning tools that enable training for intelligent agents within Unity environments using Proximal Policy Optimization (PPO), Soft Actor-Critic (SAC), Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3).

Run your first DeepUnity script

using UnityEngine;
using DeepUnity;
using DeepUnity.Optimizers;
using DeepUnity.Activations;
using DeepUnity.Modules;
using DeepUnity.Models;

public class Tutorial : MonoBehaviour
{
    [SerializeField] private Sequential network;
    private Optimizer optim;
    private Tensor x;
    private Tensor y;

    public void Start()
    {
        network = new Sequential(
            new Dense(512, 256),
            new ReLU(),
            new Dropout(0.1f),
            new Dense(256, 64, device: Device.GPU),
            new LayerNorm(),
            new ReLU(),
            new Dense(64, 32)).CreateAsset("TutorialModel");
        
        optim = new Adam(network.Parameters());
        x = Tensor.RandomNormal(64, 512);
        y = Tensor.RandomNormal(64, 32);
    }

    public void Update()
    {
        Tensor yHat = network.Forward(x);
        Loss loss = Loss.MSE(yHat, y);

        optim.ZeroGrad();
        network.Backward(loss.Gradient);
        optim.Step();

        print($"Epoch: {Time.frameCount} - Train Loss: {loss.Item}");
        network.Save();
    }
}
Digits generated by a Generative Adversarial Network (GAN) trained on MNIST dataset.

digits

Image reconstruction by a Variational Auto-Encoder (VAE) trained on MNIST dataset. (original - first line, reconstructed - second line)

digits

Reinforcement Learning

In order to work with Reinforcement Learning tools, you must create a 2D or 3D agent using Unity provided GameObjects and Components. The setup flow works similary to ML Agents, so you must create a new behaviour script (e.g. ReachGoal) that must inherit the Agent class. Attach the new behaviour script to the agent GameObject (automatically DecisionRequester script is attached too) [Optionally, a TrainingStatistics script can be attached]. Choose the space size and number of continuous/discrete actions, then override the following methods in the behavior script:

  • CollectObservations()
  • OnActionReceived()
  • Heuristic() [Optional]
  • OnEpisodeBegin() [Optional]

Also in order to decide the reward function and episode's terminal state, use the following calls:

  • AddReward(reward)
  • SetReward(reward)
  • EndEpsiode()

When the setup is ready, press the Bake button; a behaviour along with all neural networks and hyperparameters assets are created inside a folder with the behaviour's name, located in Assets/ folder. From this point everything is ready to go.

To get into advanced training, check out the following assets created:

  • Behaviour can be set whether to use a fixed or trainable standard deviation for continuous actions. Inference and Training devices are also available to be set (set both on CPU if your machine lacks a graphics card). TargetFPS modifies the rate of physics update, being equal to 1 / Time.fixedDeltaTime (default: 50).
  • Config provides all hyperparameters necesarry for a custom training session.

Behaviour script overriding example

using UnityEngine;
using DeepUnity.ReinforcementLearning;

public class MoveToGoal : Agent
{
    public Transform apple;

    public override void OnEpisodeBegin()
    {
        float xrand = Random.Range(-8, 8);
        float zrand = Random.Range(-8, 8);
        apple.localPosition = new Vector3(xrand, 2.25f, zrand);
        
        xrand = Random.Range(-8, 8);
        zrand = Random.Range(-8, 8);
        transform.localPosition = new Vector3(xrand, 2.25f, zrand);
    }

    public override void CollectObservations(StateVector sensorBuffer)
    {
        sensorBuffer.AddObservation(transform.localPosition.x);
        sensorBuffer.AddObservation(transform.localPosition.z);
        sensorBuffer.AddObservation(apple.transform.localPosition.x);
        sensorBuffer.AddObservation(apple.transform.localPosition.z);
    }

    public override void OnActionReceived(ActionBuffer actionBuffer)
    {
        float xmov = actionBuffer.ContinuousActions[0];
        float zmov = actionBuffer.ContinuousActions[1];

        transform.position += new Vector3(xmov, 0, zmov) * Time.fixedDeltaTime * 10f;
        AddReward(-0.0025f); // Step penalty
    } 

    private void OnTriggerEnter(Collider other)
    {
        if (other.CompareTag("Apple"))
        {
            SetReward(1f);
            EndEpisode();
        }
        if (other.CompareTag("Wall"))
        {
            SetReward(-1f);
            EndEpisode();
        }
    }
}
This example considers an agent (with 4 space size and 2 continuous actions) positioned in the middle of an arena that moves forward, backward, left or right, and must reach a randomly positioned goal (see GIF below). The agent is rewarded by 1 point if he touches the apple, and penalized by 1 point if he's falling of the floor, and in both situations the episode ends.

reacher

Tips

  • Parallel training is one option to use your device at maximum efficiency. After inserting your agent inside an Environment GameObject, you can duplicate that environment several times along the scene before starting the training session; this method is necessary for multi-agent co-op or adversarial training. Note that DeepUnity dynamically adapts the timescale of the simulation to get the maximum efficiency out of your machine.

  • In order to properly get use of AddReward() and EndEpisode() consult the diagram below. These methods work well being called inside OnTriggerXXX() or OnCollisionXXX(), as well as inside OnActionReceived() rightafter actions are performed.

  • Decision Period high values increases overall performance of the training session, but lacks when it comes to agent inference accuracy. Typically, use a higher value for broader parallel environments, then decrease this value to 1 to fine-tune the agent.

  • Input Normalization plays a huge role in policy convergence. To outcome this problem, observations can be auto-normalized by checking the corresponding box inside behaviour asset, but instead, is highly recommended to manually normalize all input values before adding them to the SensorBuffer. Scalar values can be normalized within [0, 1] or [-1, 1] ranges by using the formula normalized_value = (value - min) / (max - min). Note that inputs are clipped for network scability (default [-5, 5]).

  • The following MonoBehaviour methods: Awake(), Start(), FixedUpdate(), Update() and LateUpdate() are virtual. If neccesary, in order to override them, call the their base each time, respecting the logic of the diagram below.

Training on built application for faster inference

  • Training inside the Editor is a bit more cumbersome comparing to the built version. Building the application and open it up to start up the training enables faster inference, and the framework was adapted for this.

  • Whenever you want to stop the training, close the .exe file. The trained behavior is automatically saved and serialized in .json format on your desktop. Go back in Unity and check your behavior asset, and press on the newly button to overwrite the editor behavior with the trained weights from .json.

  • The previous built app, along with the trained weights in .json format are now disposable (remove them and replace the build with a new one).

Base Agent class - order of execution for event functions

agentclass

All tutorial scripts are included inside Assets/DeepUnity/Tutorials folder, containing all features provided by the framework and RL environments inspired from ML-Agents examples (note that not all of them have trained models attached).

Sorter agent whose task is to visit the tiles in ascending order

sorter

These crawlers are training to scrape over the internet

crawlers

Walkers are joining all other training parties

Disney Robots are on the way

robot

A paper describing how to implement deep neural nets, PPO, SAC, DDPG and TD3 from scratch will be released this spring...