# High-level Outline of Wrapping a RL algorithm around a PID Tuning Task

1. **Define the Control Problem and Environment.**
   * System Dynamics: for example, a second-order mass-spring-damper problem
   * Controller: A PID controller parameterized by
     $$C(s) = K_p + \frac{K_i}{s} + K_d s$$
   * Reference Signal: Set point tracking problem/ A trajectory we want the system to follow
   * State Representations: IMPORTANT!
     In an RL setting, we need to decide what the state vector (input to the RL agent) will be. Possible choices:
     * The current tracking error $e(t) = r(t) - y(t)$
     * The derivative of the error $\dot{e}(t)$
     * Possibly the integral of error, or the system's internal state if accessible
   * Actions: In RL-for-control problems, the agent's action is to update the controller parameters. Therefore, it can be set to directly output $\Delta K_p, \Delta K_i, \Delta K_d$ or directly pick the value for bandwidth $\omega_{\text CL}$.

2. **Choose the Reward Function.**
   * A typical reward function for controller tuning penalizes tracking error and actuator effort.
     $$r = -\left( \alpha |e(t) + \beta u(t)^2 \right)$$
     where $u(t)$ is the control signal (to limit large control efforts), and $\alpha, \beta$ are weighting factors.
   * Alternatively, reward can be defined based on IAE, ISE, settling time, overshoot, etc (or the combination of all of them, connected with different weighting factors). The key is to design reward function to reflect the "good control performance" means for the application. 

3. **Implementation Flow (Online).**
   * Based Nathan's work, RL should be implemented online $\rightarrow$ RL agent is learning in real time on physical or simulated system. Here's the flow:
     1. Initialize PID gains (+safty considerations/constraints); Initialize RL internal parameters (e.g., NN weights in a policy gradient method; Q-table; etc)
     2. Observe the state: at each control cycle, measure or estimate the state $s_t$ $\rightarrow$ this is difficult because full observation of the environment is not always available. 
     3. Action selection:
        RL agent chooses an action $a_t$:
        $$a_t = \left[ \Delta K_p, \Delta K_i, \Delta K_d \right]$$ or $$a_t = \Delta \omega_{\text CL}$$
        Then update the current gains accordingly.
     4. Apply controller: use the updated PID gains to compute $u(t)$ for the system; Let the system run for one control inverval (time horizon) under the NEW PID GAINs
     5. Reward Computation: After this step/horizon, compute the immediate reward $r_t$ based on the observed tracking performance (reward function)
     6. Agent Update: Use $[s_t, a_t, r_t, s_{t+1}]$ to update the RL agent's (a) policy in policy gradient; (b) value function in Q-learning or actor-critic networks in real time.
     7. Repeat: Move to the next time step; RL should discover how to adjust PID gains (or bandwidth) to maximize reward (gradually) $\rightarrow$ HOW TO ENSURE FAST CONVERGENCE?

4. **Choice of Algorithms.**
   * MODEL-FREE METHODS (Q‐Learning, DDPG, PPO, SAC):
     Typically represent the policy by actor network that inputs the current system state and outputs the increment in PID gains.Advantage: don’t need an explicit model of your system. Disadvantage: potentially slower learning and need for careful hyperparameter tuning, exploration strategies, etc.
   * MODEL-BASED RL: not common for PID tuning tasks -- HOW ABOUT A PARTIAL MODEL? CAN RL LEARN A MODEL PURELY FROM DATA (CAUSAL INFERENCE)
   * Adaptive/Hybrid approaches: Tune a standard PID structure using approximate methods?

5. **Considerations of Implementation in Practical.**
   * Safety: how to constrain RL exploration (boundary settings)
   * Rate of Updates: In real settings (especially process industries), can't/don't need to adjust PID gains on every single time step. Solution: run the system for a short period (enough time to observe some transients) $\rightarrow$ compute a performance metric $\rightarrow$ only update the gains
   * Reward Shaping: Don't rely on single episode reward because it slows learning. Need to design instantaneous reward to help RL agent converge faster
   * Discrete vs. Continuous action space: for small step in $\omega_{\text CL}$ or in $\Delta K_p$, a continuous action RL algorithm is preferred. 

### Pseudocode

In [None]:
# model-free RL, continuous action

# 1) Initialize environment
env = FirstOrderEnv()  

# 2) Initialize RL agent .
agent = RLAgent(policy_network, ...)

# 3) PID parameters or omega_CL start with a safe initial guess
Kp, Ki, Kd = Kp0, Ki0, Kd0
env.set_pid_gains(Kp, Ki, Kd)

for episode in range(MAX_EPISODES):
    state = env.reset()  # e.g. reset error, integrators, etc.
    done = False
    episode_reward = 0
    
    while not done:
        # Agent selects an action based on current state
        action = agent.select_action(state)  
        # E.g. action = [dKp, dKi, dKd]
        
        # Update PID gains
        Kp += action[0]
        Ki += action[1]
        Kd += action[2]
        env.set_pid_gains(Kp, Ki, Kd)
        
        # Step the environment for one (or several) time steps
        next_state, reward, done, info = env.step()
        
        # Agent learns from this transition
        agent.update(state, action, reward, next_state, done)
        
        state = next_state
        episode_reward += reward
    
    print(f"Episode {episode}, total_reward: {episode_reward}")


# Potential Modifications and Extensions on the Existing Work

1. Multi-Loop and MIMO Extensions
   * Multi-agent RL: Assign one agent per PID loop, with either local or shared rewards (depending on coupling)
   * Centralized Critic, Decentralized Actors: A central critic can observe global state for better training feedback, while each loop's actor use local info at runtime.

2. Safe or Constrained RL
   * Action Bounding: Keep PID parameters within known stability bounds
   * Lyapunov-based Critics or Barrier Functions: incorporate formal constraints so that actions violating stability or safety constraints are automatically corrected.

3. Robust or Adaptive RL for Nonstationary Environments: because industrial processes drift over time
   * Online Fine-Tuning: Keep updating the actor/critic with a small learning rate as the environment changes
   * Domain Randomization in simulation: produce better generalization abilities by training across a range of process parameters or disturbances.

4. Hybrid Approaches: Model-based (or partially model-based) + RL
   * Use an MPC-based safe supervisor to override dangerous actions
   * Model-Reference RL: leverage a known or approximated plant model to shape the reward or guide exploration

5. Sophisticated Reward Function (unique design for different needs)
   * Multi-objective Reward: Weighted sum approach for different KPIs
   * Sparse or Event-Driven Rewards: If real-time feedback is not always available or if certain events are very costly, modify the reward to reflect these priorities.

6. Extended Testing and Benchmarking
   * Benchmark Processes: test on well-estabilished control problems
   * Compare with Adaptive Control

7. Detailed Stability & Performance Analysis
   * Sensitivity Analysis
   * Convergence Guarantees (math proofs)

8. Other thoughts:
   * Train offline first, then gradually adapt online
   * Let RL controller run in parallel with existing PID, compare performance, and only swtich over when proven superior
   * Transfer learning (could be from one PID controller to another PID controller, or could be from a plant to another plant)
   * model interpretability: sequential decision making? 

# Wild Idea

Research Topic: Multi-Agent Reinforcement Learning for Coordinated Tuning of Multiple PID Controllers in Large-Scale or MIMO Systems

Context and motivation:

* In many industrial processes, you often have multiple control loops running in parallel. Each loop is managed by a PID controller.
* These loops can be highly coupled, meaning the tuning of one PID loop may affect the performance of others due to process interactions, shared resources, or physical couplings.
* Traditional tuning methods (like Ziegler-Nichols) do not account for these interactions, which can lead to suboptimal or unstable performance at the plant level. 

Core Idea:

* Use a multi-agent reinforcement learning (MARL) approach in which each PID loop is tuned by a separate RL agent in real time (or near-real time).
* The agents must coordinate with each other through (makes me think of social RL) a shared reward or communication mechanism to ensure that the overall plant performance is optimized rather than each loop optimizing its own performance in isolation. 