# [Unity ML-Agents — Quick Guide](https://docs.unity3d.com/Packages/com.unity.ml-agents%404.0/manual/ML-Agents-Overview.html)

> **Objective**: Give you a clear, actionable summary of what ML‑Agents is, how it works (essential theory), how to install it **today**, and how to run the **3D Balance Ball sample** end‑to‑end up to the ONNX model. This is a **first draft** to review together.

---

## 1) What is ML‑Agents, in two sentences

**Unity ML‑Agents Toolkit** is a bridge between **Unity** (where your simulation environment with Agents runs) and a **Python trainer** (PPO, SAC, Imitation Learning…) that trains neural networks based on **observations, actions, and rewards**. The training output is an **ONNX model** you can use directly in Unity (inference) without Python.

---

## 2) Key Concepts (Essential Theory)

### Observations

What the agent "sees". Can be **numeric** (vectors: distances, velocities, angles…), **raycasts**, or **visual** (images from cameras). Observations are **from the agent's perspective**, not the global scene state.

### Actions

What the agent can do. **Discrete** (a choice among N actions) or **continuous** (real values, e.g., acceleration/rotation).

### Reward (Reward Signals)

Scalars expressing how well the agent is performing.

* **Extrinsic**: Defined by the environment (reach goal → +1, fall → −1…).
* **Intrinsic** (optional): Generated by the trainer to guide exploration (e.g., **Curiosity**, **RND**) or for imitation (**GAIL**).

### Policy

The **neural network** that maps observations → actions.

* **Training**: The policy is optimized in Python while the simulation runs in Unity.
* **Inference**: The exported policy (ONNX) runs inside Unity at runtime.

---

## 3) Ecosystem Components

* **Learning Environment (Unity)**: Your scene with **Agents**, physics, reward logic, and episode end rules.
* **Agent (C#)**: Script defining **CollectObservations**, applying **Actions** in `OnActionReceived`, assigning **Reward**, and handling resets/terminations.
* **Behavior**: Agent parameters (observation/action spaces, Behavior Name, type: Learning / Heuristic / Inference).
* **Python Low‑Level API (mlagents_envs)**: Communication channel with Unity.
* **External Communicator**: The "cable" connecting Unity ↔ Python.
* **Python Trainers (mlagents)**: RL/IL algorithms (CLI `mlagents-learn`).
* **Wrapper**: **Gym** and **PettingZoo** integrations for using Unity with other algorithms.
* **Side Channels**: Custom data exchange (e.g., environment parameters, curriculum, randomization).

---

## 4) Training/Inference Modes

* **Built‑in training**: Unity sends observations → Python calculates actions, optimizes policy; after training, export **ONNX** and use it in Unity.
* **Cross‑platform inference**: ONNX runs on all platforms supported by Unity.
* **Custom training**: You can control everything from Python using the low-level API or use Gym/PettingZoo wrappers.

---

## 5) Training Scenarios

* Standard **Single‑Agent**.
* **Simultaneous Single‑Agent**: Multiple copies of the same agent in parallel (same Behavior) for stability and speed.
* **Adversarial Self‑Play**: Agents training against historical versions of each other (PPO recommended).
* **Cooperative Multi‑Agent**: Shared rewards; **MA‑POCA** support (centralized credit assignment) even with self‑play.
* **Competitive Multi‑Agent** or **Ecosystem**: Conflicting objectives, different species/roles.

---

## 6) Training Methods (Environment‑Agnostic)

* **RL**:

  * **PPO** (default): Robust, general‑purpose.
  * **SAC**: Off‑policy, very sample‑efficient, ideal for continuous actions and slow environments; uses a replay buffer.
* **Intrinsic rewards**: **Curiosity** and **RND** to guide exploration in sparse reward environments.
* **Imitation Learning**:

  * **BC** (Behavioral Cloning): Exact replication of demonstrations.
  * **GAIL**: Adversarial reward for "more similar to demos"; often combined with extrinsic + BC.

> You can combine RL + BC + GAIL (e.g., start from demos to unlock difficult environments, then fine‑tune with RL).

---

## 7) Tools for Realistic Environments

* **Curriculum Learning**: Gradually introduce difficulty (scene parameters that evolve with performance).
* **Environment Parameter Randomization** (Domain Randomization): Randomly sample environment parameters to make the agent more **robust** and generalize better.

---

## 8) Supported Model Types

* **Vector / Raycast**: MLP fully‑connected (configurable: hidden units, layers).
* **Visual**: CNN (simple encoder, DQN‑style, **IMPALA ResNet**) with multiple cameras per agent.
* **Variable‑length obs**: **Attention** for dynamic lists of entities.
* **Memory**: **LSTM** for partial observability and context‑dependent decisions.

---

## 9) Installation (Release 23: Recommended Setup)

> Target: Unity **6000.0+** · Python **3.10.12** · Unity package `com.unity.ml-agents **4.0.0**` · Python packages `mlagents==1.1.0`, `mlagents-envs==1.1.0`.

### 9.1 Unity

1. Install **Unity 6000.x** via **Unity Hub**.
2. Create a 3D project (e.g., *MLA_TestProject*).
3. **Package Manager** → **Add package by name…** → `com.unity.ml-agents` (enable *Show preview packages* if needed).

   * **Or (Development)**: Clone the repo and **Add package from disk…** pointing to `com.unity.ml-agents/package.json`.

### 9.2 Python (Recommended with Conda)

```bash
# Create a dedicated environment
conda create -n mlagents_py310 python=3.10.12 -y
conda activate mlagents_py310

# Install versions compatible with Release 23
python -m pip install --upgrade pip
pip install "mlagents==1.1.0" "mlagents-envs==1.1.0"
# If grpcio fails on Windows:
# conda install "grpcio=1.48.2" -c conda-forge

# Verify
mlagents-learn --help
```

> **Windows/GPU Note (Optional)**: If you want CUDA 12.1, install `torch~=2.2.1` from the NVIDIA PyTorch index first, then ML‑Agents. For CPU‑only, the ML‑Agents wheels come with a supported version of torch.

### 9.3 (Dev Option) Install from Source

```bash
git clone --branch release_23 https://github.com/Unity-Technologies/ml-agents.git
cd ml-agents
python -m pip install ./ml-agents-envs
python -m pip install ./ml-agents
mlagents-learn --help
```

For live edits to Python packages:

```bash
pip install -e ./ml-agents-envs
pip install -e ./ml-agents
```

---

## 10) Run the **3D Balance Ball** Sample (End‑to‑End)

### 10.1 Open the Sample Scene

* Open the `ml-agents/Project` project in Unity.
* Scene: `Assets/ML-Agents/Examples/3DBall/Scenes/3DBall.unity`.
* Each platform has an **Agent** with **Behavior Name = 3DBall**, **Obs size = 8**, **Continuous Actions = 2**.

### 10.2 Try a **Pre‑trained Model** (Inference)

1. Project → `Assets/ML-Agents/Examples/3DBall/TFModels/`.
2. Select the Agent → **Behavior Parameters**:

   * **Behavior Type**: *Inference Only*.
   * **Model**: Drag and drop the pre‑trained `3DBall.onnx`.
   * **Inference Device**: CPU.
3. **Play**: The platforms balance the ball.

### 10.3 **Training from Scratch** (PPO/SAC)

1. In the terminal (active env):

   ```bash
   # PPO recommended to converge quickly on 3DBall
   mlagents-learn config/ppo/3DBall.yaml --run-id=bb_ppo_01
   # Or SAC
   # mlagents-learn config/sac/3DBall.yaml --run-id=bb_sac_01
   ```
2. When "**Listening on port …**" appears, go back to Unity.
3. **Behavior Type**: *Default* (training). **Model**: Empty (no ONNX assigned).
4. Press **Play**. In the terminal, you'll see **Mean Reward** increasing.

### 10.4 Monitor with **TensorBoard** (Optional)

```bash
tensorboard --logdir results
```

Open `http://localhost:6006` → follow *Environment/Cumulative Reward*.

### 10.5 Export and Use the **ONNX Model**

* Stop with **Ctrl+C**: The trainer saves the model in `results/<run-id>/.../Policy.onnx` (sometimes also copies `<behavior>.onnx` at the run level).
* Copy the file to `Assets/ML-Agents/Examples/3DBall/TFModels/`.
* In **Behavior Parameters** set:

  * **Behavior Type**: *Inference Only*.
  * **Model**: Your `.onnx`.
  * **Play** to test the behavior **trained by you**.

> **Resume**: To continue an interrupted run, rerun the same command with `--resume`. If the run‑id already exists and you want to start over, use `--force` or change `--run-id`.

---

## 11) What Defines the **Task** vs. the **Training**

* **Task** (what to learn): Is in the **C# code** of the Agent and the scene (observations, actions, rewards, reset).
* **Training** (how to learn): Is in the **YAML file** (algorithm, hyperparameters, max_steps, summary_freq, reward signals, etc.).
* **Output**: The trained **policy** in **ONNX** format (inference only). To re‑train, you need the **checkpoints** (not the ONNX).

---

## 12) Tips & Pitfalls

* **Behavior Name** in Unity must match the one in the YAML file.
* During **training**, leave **Model** empty (no ONNX assigned) and **Behavior Type = Default**.
* If `mlagents-learn` says the run‑id exists, use `--resume`, `--force`, or a new `--run-id`.
* On Windows, if `grpcio` fails in build: `conda install "grpcio=1.48.2" -c conda-forge` and reinstall the packages.
* For complex environments (sparse rewards), consider **Curiosity**/**RND** and/or **BC/GAIL** with demos.

---

## 13) Quick Glossary

* **Agent**: GameObject with a script that observes, acts, and receives rewards.
* **Behavior**: Agent's decision parameters (spaces, name, type).
* **Policy**: Neural network mapping observations → actions.
* **Trainer**: Python process optimizing the policy (PPO/SAC/GAIL/BC).
* **ONNX**: Exported model for inference in Unity.
* **Checkpoint**: Trainable state of the trainer for `--resume`.
* **Curriculum / Randomization**: Techniques for progressive difficulty and robustness.