<h1 style="text-align: center; vertical-align: middle;">
    <span style="color: #B74F3D;"> 3rd Reinforcement Learning for Autonomous Accelerators workshop tutorial</span>
    <span style="color: #666666;">: Beam Transverse Steering at ARES Linear Accelerator</span>
</h1>

<div style="text-align: center;">
    <img src="../img/rl4aa_logo.png" alt="RL4AA Logo" style="max-width: 100%; height: auto;">
</div>

<h2 style="color: #b51f2a">Getting started</h2>

- You will need **Python 3.12 or higher** to run this code &#x2757;
- You will require about **1 GB of free disk space** &#x2757;
- Make sure you have Git installed in your terminal &#x2757;


Start by cloning locally the repository of the tutorial by running this command in your terminal:

```bash
git clone https://github.com/RL4AA/rl4aa25-tutorial.git
```

<h2 style="color: #b51f2a">Installing virtual environment</h2>

### Using Conda

- If you don't have conda installed already, you can install the `miniconda` as [described here](https://docs.conda.io/projects/miniconda/en/latest/miniconda-install.html).
- We recommend to install `miniconda` the day beforehand to avoid network overload during the tutorial &#x2757; &#x2757;

Once `miniconda` is installed run this command in your terminal:

```bash
conda env create -f environment.yml
```

This should create a virtual environment named `rl25-tutorial` and install the necessary packages inside.

Afterwards, activate the environment using

```bash
conda activate rl25-tutorial
```


<h2 style="color: #b51f2a">Installing virtual environment</h2>

### Using venv

_If you don't have conda installed:_

Alternatively, you can create the virtual env with

```bash
python3 -m venv rl-tutorial
```

and activate the env with `$ source <venv>/bin/activate` (bash) or `C:> <venv>/Scripts/activate.bat` (Windows)

Then, install the packages with `pip` within the activated environment

```bash
python -m pip install -r requirements.txt
```

Afterwards, you should be able to run the provided scripts.

<h2 style="color: #b51f2a">Check your installation</h2>
If you set up your virtual environment correctly and is activated you should be able to run the next cell without any errors:

In [None]:
import sys

sys.path.append("..")

import yaml
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
from gymnasium.wrappers import RescaleAction
from IPython.display import clear_output, display, Latex, HTML, Markdown

from src.environments import ea
from src.environments.ea_auxiliary import make_eval_env
from src.utils import evaluate_mae

<h2 style="color: #b51f2a"> ARES (Accelerator Research Experiment at SINBAD)</h2>

ARES is an S-band radio frequency linac at the DESY Hamburg site equipped with a photoinjector and two independently driven traveling wave accelerating structures. The main research focus is the generation and characterization of sub-femtosecond electron bunches at relativistic particle energy. The generation of short electron bunches is of high interest for radiation generation, i.e. by free electron lasers.

<img src="../img/ARES_layout.png" style="width:100%; margin:auto;"/>

- **Final energy**: 100-155 MeV
- **Bunch charge**: 0.01-200 pC
- **Bunch length**: 30 fs - 1 ps
- **Pulse repetition rate**: 1-50 Hz


<h2 style="color: #b51f2a">The accelerator problem we want to solve</h2>

We would like to focus and center the electron beam on a diagnostic screen using corrector and quadrupole magnets

<img src="../img/ares_magnets.png" style="width:70%; margin:auto;"/>

<h2 style="color: #b51f2a">Formulating the RL problem</h2>
<h3>Overview of our study case</h3>
<img src="../img/ares_rl_problem.png" style="width:70%; margin:auto;"/>

<h3 style="color:#038aa1;">Discussion</h3>
<p style="color:#038aa1;"> $\implies$  Is the action space continuous or discrete? </p>
<p style="color:#038aa1;"> $\implies$  Is the problem fully observable or partially observable?</p>

<h2 style="color: #b51f2a">Formulating the RL problem</h2>
<h3>Actions</h3>

<div class="row">
    <div class="column" style="width:60%;float:left">
        <p>In the ARES transverse tuning task we have 3 quadrupoles and 2 corrector magnets</p>
        <p>The actions are:
            <ul>
            <li><b>Quadrupole magnet strength</b> $k_{1,2,3}$ $[1/m^2]$</li>
            <li><b>Corrector deflection angle</b> $\theta_\mathrm{v, h}$ $[mrad]$ (vertical and horizontal</li>
            </ul>
        </p>
        <p>In our control system we can set these derived values directly according the beam energy</p>
        <p>$\implies$ <code>actions</code> $=[k_{\mathrm{Q1}},k_{\mathrm{Q2}},\theta_\mathrm{CV},k_{\mathrm{Q3}},\theta_\mathrm{CH}]$</p>
            <p>is a 5-dimensional array</p>
    </div>
    <div class="column" style="width:40%;float:right">
        <img src="../img/dipole.png" style="width:50%; margin:auto;"/>
        <img src="../img/quads.png" style="width:35%; margin:auto;"/>
    </div>
</div>


<h2 style="color: #b51f2a">Formulating the RL problem</h2>
<h3>Observation / state</h3>

<div class="row">
    <div class="column" style="width:50%;float:left">
        <p>Observation is the information an agent receives about the current state of the environment</p>
        <p>It should provide enough information so that the agent can solve this problem.</p>
        <p>The observation does not necessarily cover the entire (internal) state of the environment.</p>
        <h3 style="color:#038aa1;">Discussion</h3>
        <p style="color:#038aa1;"> $\implies$ What should be included in the <code>observation</code>?  </p>
        <p style="color:#038aa1;"> $\implies$ What can be observed in the simulation? </p>
        <p style="color:#038aa1;"> $\implies$ What cannot be observed in the real world? </p>
        <p style="color:#038aa1;"> $\implies$ How does this relate to the environment? </p>
    </div>
    <div class="column" style="width:50%;float:right">
      <img src="../img/screen_2.png" style="width:40%; margin:auto;"/>
      <p style="clear:both; font-size: small; text-align: center; margin-top:1em;">
          The screen is made from scintillating material and glows when hit by electrons</p>
      <img src="../img/screen_1.png" style="width:40%; margin:auto;"/>
      <p style="clear:both; font-size: small; text-align: center; margin-top:1em;">The camera films the screen</p>
    </div>
</div>

<h2 style="color: #b51f2a">Formulating the RL problem</h2>
<h3> The environment's state</h3>

The `state` can be fully described by with four components:

- The **target beam**: the beam we want to achieve, our goal
  - as a 4-dimensional array $b^\mathrm{(t)}=[\mu_x^{(\mathrm{t})},\sigma_x^{(\mathrm{t})},\mu_y^{(\mathrm{t})},\sigma_y^{(\mathrm{t})}]$, where $\mu$ denotes the position on the screen, $\sigma$ denotes the beam size, and $t$ stands for "target".
- The **incoming beam**: the beam that enters the EA upstream
  - $I = [\mu_x^{(\mathrm{i})},\sigma_x^{(\mathrm{i})},\mu_y^{(\mathrm{i})},\sigma_y^{(\mathrm{i})},\mu_{xp}^{(\mathrm{i})},\sigma_{xp}^{(\mathrm{i})},\mu_{yp}^{(\mathrm{i})},\sigma_{yp}^{(\mathrm{i})},\mu_s^{(\mathrm{i})},\sigma_s^{(\mathrm{i})}]$, where $i$ stands for "incoming"
- The **magnet strengths** and **deflection angles**
  - $[k_{\mathrm{Q1}},k_{\mathrm{Q2}},\theta_\mathrm{CV},k_{\mathrm{Q3}},\theta_\mathrm{CH}]$
- The **transverse misalignments** of **quadrupoles** and the **diagnostic screen**
  - $[m_{\mathrm{Q1}}^{(\mathrm{x})},m_{\mathrm{Q1}}^{(\mathrm{y})},m_{\mathrm{Q2}}^{(\mathrm{x})},m_{\mathrm{Q2}}^{(\mathrm{y})},m_{\mathrm{Q3}}^{(\mathrm{x})},m_{\mathrm{Q3}}^{(\mathrm{y})},m_{\mathrm{S}}^{(\mathrm{x})},m_{\mathrm{S}}^{(\mathrm{y})}]$

<h3 style="color:#038aa1;">Discussion</h3>
<p style="color:#038aa1;"> $\implies$ Do we (fully) know or can we observe the state of the environment?</p>


<h2 style="color: #b51f2a">Formulating the RL problem</h2>
<h3> Our definition of observation</h3>

The `observation` for this task consists of three components:

- The **target beam**:  The desired beam, or the goal we aim to achieve.
  - as a 4-dimensional array $b^\mathrm{(t)}=[\mu_x^{(\mathrm{t})},\sigma_x^{(\mathrm{t})},\mu_y^{(\mathrm{t})},\sigma_y^{(\mathrm{t})}]$, where $\mu$ represents the position on the screen, $\sigma$ denotes the beam size, and $t$ refers to the "target".
- The **current beam**: The beam currently in place.
  - $b^\mathrm{(c)}=[\mu_x^{(\mathrm{c})},\sigma_x^{(\mathrm{c})},\mu_y^{(\mathrm{c})},\sigma_y^{(\mathrm{c})}]$, where $c$ represents "current".
- Magnet settings: The **magnet strengths** and **deflection angles**
  - $[k_{\mathrm{Q1}},k_{\mathrm{Q2}},\theta_\mathrm{CV},k_{\mathrm{Q3}},\theta_\mathrm{CH}]$

<h3 style="color:#038aa1;">Discussion</h3>
<p style="color:#038aa1;"> $\implies$ Does this observation definition satisfy the Markov property? That is, does the probability distribution for the next beam depend only on the current observation, or is it influenced by other state information?</p>

<h2 style="color: #b51f2a">Formulating the RL problem</h2>
<h3>Goal and reward</h3>

Our goal is divided into two tasks:

1. **Steering** the beam to the desired position.
2. **Focusing** the beam to the desired size.

<h2 style="color: #b51f2a">About libraries for RL</h2>

There are several libraries that provide pre-implemented RL algorithms and frameworks for creating environments. In this notebook, we use:

- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/) for the RL algorithms
- [Gymnasium](https://gymnasium.farama.org/) for the environment
<img src="../img/rl_libraries.png"  style="width:60%; margin:auto;"/>
<p style="clear:both; font-size: small; text-align: center; margin-top:1em;">More info <a href="https://neptune.ai/blog/the-best-tools-for-reinforcement-learning-in-python">here</a></p>


**Note**:

- Gymnasium is the successor of the [OpenAI Gym](https://www.gymlibrary.dev/).
- Stable-baselines3 now has an early-stage JAX implementation [sbx](https://github.com/araffin/sbx).

<h2 style="color: #b51f2a">Environment</h2>

We take all the elements of the RL problem we previously defined and represent the tuning task as a `gym`-based environment, a standard library for RL tasks.

A custom `gym.Env` consists of the following components:

- **Initialization**: Sets up the environment and defines the `observation_space` and `action_space`
- `reset` **method**: Resets the environment for a new episode and returns a 2-tuple `(observation, info)`
- `step` **method**: Contains the core logic. It accepts an action, updates the environment state, generates a new observation, computes the reward, and returns a 5-tuple `(observation, reward, terminated, truncated, info)`.
  - `terminated` Determines whether the episode should end based on the underlying MDP (e.g., goal reached, threshold exceeded)
  - `truncated` Checks if the episode should be truncated due to conditions outside the MDP (e.g., time limits).
- `render` **method**: Provides a visual representation of the environment (e.g., video or plots).

<h2 style="color: #b51f2a">An overview of this RL project</h2>
<img src="../img/ares_ea_rl_technical_setup.png"  style="width:100%; margin:auto;"/>

<h2 style="color: #b51f2a">Code Directory Structure in RL4AA-2025 Repository</h2>

<p> The RL4AA-2025 Git repository contains all the necessary code and configurations for running reinforcement learning (RL) and Gaussian Process Model Predictive Control (GP-MPC) experiments for the ARES-EA transverse tuning task. Below is an overview of the key directories and their contents:</p>

- `src` Contains the source code for the RL environment and the GP-MPC controller
  - `src/environments/ea` contains the gymnasium environment for the ARES-EA transverse tuning task
  - `src/reward` contains files for the reward engineering (combination of rewards, transformation, ...)
  - `src/wrappers` contains custom wrappers for the EA environment
    - `src/wrappers/ea_mpc_episode_with_plotting` contains the wrapper for running GP-MPC (mainly it creates the visualization)
  - `src/train` contains scripts to train a default PPO agent to solve the task (can be used as a benchmark for evaluating MPC controller)
  - `src/gpmpc` contains the GP-MPC controller
    - `src/gpmpc/control_object` implements the controller
      - `gp_models` implements the GP model for modeling the transition of the environment
      - `gp_mpc_controller` implements the controller
    - `src/gpmpc/utils` contains utility functions for the GP-MPC controller
- `data/trail.yaml` contains the pre-selected task configurations for evaluation
- `config/` config files for running GP-MPC control

<h2 style="color: #b51f2a">The ARES Experimental Area (ARES-EA) Environment</h2>

- We formulated the ARES-EA task as a `gym` environment, allowing our algorithm to easily interface with both the simulation and real machine backends as shown before.
  
- In this section, you will become familiar with the environment for beam focusing and positioning at the ARES accelerator.

Important APIs:

- `reset`: Resets the magnets to their initial values in both real and simulation cases. In the simulation, it also regenerates the incoming beam and (optionally) resets magnet misalignments.
- `step`: Adjusts the magnets to new settings and observes the beam (either by running a simulation or observing the screen image in the real world).

<div style="text-align: center; width:100%;">
    <h1>The Power of Standard Optimizers </h1>
</div>

The Nelder-Mead optimizer is a widely used, derivative-free optimization algorithm, well-suited for solving complex problems where gradients are difficult or expensive to compute. It relies on a simplex of n + 1 points to explore the search space and iteratively converges towards an optimal solution. In the context of the ARES-AE beam control problem, the Nelder-Mead optimizer is applied to adjust the magnet settings, effectively steering the beam to desired positions and focusing it to the required size. Its robustness in handling non-linear, noisy objective functions makes it an effective method for this type of application, where precise control over the beam's trajectory is crucial.

However, to leverage the power of this optimizer, we first need to establish a well-defined environment that accurately models the problem at hand. The environment plays a crucial role in providing the feedback necessary for optimization, and in our case, it serves as the foundation for beam focusing and positioning at the ARES accelerator.

Before diving into control algorithms, it's important to set up the simulation, define the state and observation space, and ensure the environment provides meaningful feedback to an agent or controller. In our setup, the environment models the behavior of the electron beam as it passes through magnets and interacts with the diagnostic screen. By initializing the environment, we ensure that:

- We define a clear optimization goal – aligning the observed beam parameters with the target beam.

- We establish a reproducible testbed – allowing us to evaluate different control methods systematically.


<h3 style="color:#038aa1;">Set a target beam you want to achieve</h3>
<p style="color:#038aa1;"> $\implies$ Let's define the desired position $(\mu_x, \mu_y)$ and size $(\sigma_x, \sigma_y)$ of the beam on the screen</p>
<p style="color:#038aa1;"> $\implies$ Modify the <code>target_beam</code> parameters list below, where the order of the arguments is $[\mu_x,\sigma_x,\mu_y,\sigma_y]$</p>
<p style="color:#038aa1;"> $\implies$  Consider the screen dimensions ($\pm$ 2e-3 m) when setting the target values</p>
<p style="color:#038aa1;"> $\implies$ The target beam will be visually represented as a blue circle on the screen</p>

In [None]:
# Set EA configuration parameters path
config_path = Path("../config/ea_eval_config.yaml")

# Load EA configuration parameters
with config_path.open("r", encoding="utf-8") as file:
    config = yaml.safe_load(file)

# Create a EA environment
env = make_eval_env(config, ea)

In [None]:
# Specificy target beam parameters, adjust as desired
target_beam = np.array([1e-3, 2e-4, 1e-3, 2e-4])

env.target_beam_values = target_beam
env.reset()  # Render one simulation frame

# Visually inspect beam position
plt.figure(figsize=(7, 4))
plt.imshow(env.render())  # Plot the screen image
# Let's improve the rendering to be the same as for RL4AA23

In [None]:
# Reset environment
env.reset()

# Get beam difference based on mean absolute error (MAE)
env.unwrapped.get_beam_difference(metric="mae")

# Visually inspect beam difference
plt.imshow(env.unwrapped.render())

## Tackling  the ARES Experimental Area Beam Transverse Tuning Problem

The objective is to implement a controller or optimization method capable of steering the observed electron beam $\bf{b}$ towards the target beam parameters, $\bf{b}'$.

The difference between the observed and target beam parameters can be characterized by the mean absolute error (MAE):
$$
d_\text{MAE} (\bf{b}, \bf{b}') = \frac{1}{4} \sum_{i=1}^{4} |\bf{b}_i - \bf{b}'_i|
$$

The algorithm is allowed to interact with the environment for a total of $T$ steps. ($T=200$ in this example)

The performance of the method will be evaluated using the following metrics:

- Best MAE achieved by the method: $\min d_i, i=1,\dots, T$
- Cumulative MAE difference over the episode: $\sum_{i=1}^{T} d_i$

### Baseline Method using Standard Optimizer

To help you get started, below we provide a simple example using the Nelder-Mead optimizer to demonstrate how to interact with the environment.

In [None]:
# Here we provide the baseline Nelder-Mead simplex Method
from scipy.optimize import minimize

In [None]:
# Wrap the env interaction in an objective function for optimization
def objective(x):
    env.step(x)
    return env.unwrapped.get_beam_difference(metric="mae")

In [None]:
# Reset the Environment
obs, _ = env.reset()

# Select the "magnets" parameters as observations
x0 = obs[4:9]

In [None]:
# Specify minimization procedure
res = minimize(
    objective,
    x0,
    method="nelder-mead",
    options={
        "xatol": 1e-8,
        "disp": True,
        "maxfev": config["env_wrapper"][
            "max_episode_steps"
        ],  # Maximum number of function evaluations
    },
)

## Look into the result

In [None]:
# Extracting observations
observations = env.get_wrapper_attr("observations")

# Run MAE evaluation
evaluate_mae(observations)

### Now it is time to develop your own controller!



In [None]:
# Now let's create a controller

## Development Ideas:

In the challenge, 

- Have a utility function to prepare the basic env setup (should not be modified) (cheetah backend, magnet range, ...)
  - The users can change the 
  - 
- Create a new env wrapper to log the necessary statistics for the evaluation, e.g.
  - MAEs over steps
  - Wall-time used for each step
- Evaluation Script that runs the control on several tasks and save the results
  - Use argparse to decide which tasks to load, we provide the `train_tasks` in the beginning
  - In the end, we provide the `test_tasks` for the final evaluation
