# V3 Notebook 3: Advanced Policy Learning with Reinforcement Learning

**Project:** `AutoPharm` (V3)
**Goal:** To explore a fundamentally different approach to control by learning a policy directly from environmental interaction. This notebook implements a Reinforcement Learning (RL) agent using the Stable-Baselines3 library to control our simulated plant, moving beyond model-based control to direct policy optimization.

### Table of Contents
1. [Theory: Beyond Prediction to Direct Action](#1.-Theory:-Beyond-Prediction-to-Direct-Action)
2. [Framing the Control Problem for RL](#2.-Framing-the-Control-Problem-for-RL)
3. [Building a Custom Gym Environment](#3.-Building-a-Custom-Gym-Environment)
4. [Training the RL Agent](#4.-Training-the-RL-Agent)
5. [Evaluating the Learned Policy](#5.-Evaluating-the-Learned-Policy)

--- 
## 1. Theory: Beyond Prediction to Direct Action

Our MPC approach (V1 and V2) is **model-based**. Its effectiveness is fundamentally limited by the accuracy of its predictive model. If the model is wrong, the control decisions will be suboptimal. Furthermore, the multi-step process of predicting, evaluating, and optimizing can be computationally intensive.

**Reinforcement Learning (RL)** offers a **model-free** alternative. Instead of learning *what will happen*, the RL agent learns *what to do*. It directly learns a **policy**, which is a mapping from a given state to an optimal action.

The agent learns through a process of trial and error:
1.  It **observes** the state of the environment.
2.  It takes an **action** based on its current policy.
3.  The environment transitions to a new state and gives the agent a **reward** (or penalty).
4.  The agent updates its policy to favor actions that lead to higher cumulative rewards over time.

For complex, nonlinear systems, an RL agent can potentially discover highly effective, non-intuitive control policies that are difficult to formulate with a model-based approach.

--- 
## 2. Framing the Control Problem for RL

To apply RL, we must define our problem in the standard RL terminology:

*   **Environment:** Our `AdvancedPlantSimulator`.
*   **Agent:** The RL algorithm we choose (e.g., PPO, SAC from Stable-Baselines3).
*   **State (Observation):** A vector representing the current state of the plant. This must include the current CMAs (`d50`, `LOD`), the current CPPs, and the target setpoints. The agent needs to know both where it is and where it's supposed to go. A history of past states could also be included.
*   **Action Space:** A continuous range of values for each of our controllable CPPs (`spray_rate`, `air_flow`, `carousel_speed`). The agent will output an action within this space.
*   **Reward Function:** This is the most critical part of the design. The reward function guides the entire learning process. A good reward function for our problem would:
    *   Give a large positive reward for being close to the CMA setpoints.
    *   Give a small negative reward for large changes in CPPs (to encourage smooth control).
    *   Give a large negative penalty for violating process constraints.

--- 
## 3. Building a Custom Gym Environment

RL libraries like Stable-Baselines3 expect the environment to follow the `gymnasium` (formerly OpenAI Gym) API. We need to create a wrapper around our `AdvancedPlantSimulator` that implements this standard interface (`step`, `reset`, etc.).

In [None]:
import sys
import os
sys.path.append('../src')

from autopharm_core.rl.environment import GranulationEnv
from autopharm_core.common.types import StateVector, ControlAction
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Let's first test our custom Gymnasium environment to ensure it works correctly:

In [None]:
# --- Configuration ---
ENV_CONFIG = {
    'initial_cpps': {'spray_rate': 120.0, 'air_flow': 500.0, 'carousel_speed': 30.0},
    'target_d50': 400.0,
    'target_lod': 2.0,
    'episode_length': 500
}

# Create and test the environment
env = GranulationEnv(config=ENV_CONFIG)

print("Environment created successfully!")
print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")

# Test environment step
obs, _ = env.reset()
print(f"\nInitial observation: {obs}")

# Take a random action
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
print(f"After random action {action}: obs={obs[:2]}, reward={reward:.3f}")

--- 
## 4. Training the RL Agent

With our custom environment in place, we can now use Stable-Baselines3 to train an RL agent. We will use the **Proximal Policy Optimization (PPO)** algorithm, which is a robust, state-of-the-art choice for continuous control problems.

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.monitor import Monitor
import os

# Create directories for model and logs
os.makedirs('../data/models', exist_ok=True)
os.makedirs('../data/tensorboard', exist_ok=True)

MODEL_SAVE_PATH = "../data/models/ppo_granulation_policy"

# --- 1. Create and check the environment ---
train_env = GranulationEnv(config=ENV_CONFIG)
train_env = Monitor(train_env)  # Monitor for logging

eval_env = GranulationEnv(config=ENV_CONFIG)
eval_env = Monitor(eval_env)

check_env(train_env, warn=True)
print("Environment check passed!")

# --- 2. Configure PPO with optimized hyperparameters ---
model = PPO(
    "MlpPolicy", 
    train_env, 
    verbose=1,
    tensorboard_log="../data/tensorboard/",
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01
)

# --- 3. Set up evaluation callback ---
eval_callback = EvalCallback(
    eval_env, 
    best_model_save_path=MODEL_SAVE_PATH,
    log_path="../data/models/", 
    eval_freq=10000,
    deterministic=True, 
    render=False
)

print("\nStarting RL training... (This will take some time)")
print("Training with 100,000 timesteps for demonstration")
print("For production use, consider 1,000,000+ timesteps")

# Train the model
model.learn(total_timesteps=100000, callback=eval_callback)

# --- 4. Save the final policy ---
model.save(MODEL_SAVE_PATH + "_final")
print(f"\nTraining complete. Policy saved to {MODEL_SAVE_PATH}_final.zip")

--- 
## 5. Evaluating the Learned Policy

The final step is to evaluate how well our trained agent can control the plant. We will load the saved policy and run it on our environment for a fixed number of steps, logging the results to see if it successfully drives the state to the target setpoint.

In [None]:
# --- Load the best trained model ---
try:
    trained_model = PPO.load(MODEL_SAVE_PATH + "_final")
    print("Loaded final model")
except:
    try:
        trained_model = PPO.load(MODEL_SAVE_PATH + "/best_model")
        print("Loaded best model from evaluation callback")
    except:
        print("No trained model found. Using the current model.")
        trained_model = model

In [None]:
# --- Run evaluation loop ---
eval_env_test = GranulationEnv(config=ENV_CONFIG)
obs, _ = eval_env_test.reset()

log = []
cumulative_reward = 0

print("Running policy evaluation for 500 timesteps...")

for i in range(500):
    action, _states = trained_model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = eval_env_test.step(action)
    cumulative_reward += reward
    
    # Extract current state information
    current_state = obs
    log_entry = {
        'time': i,
        'd50': current_state[0],
        'lod': current_state[1],
        'd50_target': current_state[2],
        'lod_target': current_state[3],
        'spray_rate': current_state[4],
        'air_flow': current_state[5],
        'carousel_speed': current_state[6],
        'reward': reward,
        'cumulative_reward': cumulative_reward,
        'action_spray_rate': action[0],
        'action_air_flow': action[1],
        'action_carousel_speed': action[2]
    }
    log.append(log_entry)
    
    if terminated or truncated:
        obs, _ = eval_env_test.reset()

df_eval = pd.DataFrame(log)
print(f"Evaluation complete. Final cumulative reward: {cumulative_reward:.2f}")

In [None]:
# --- Comprehensive Visualization ---
fig, axes = plt.subplots(2, 2, figsize=(20, 14))
fig.suptitle('RL Policy Performance Analysis', fontsize=16, fontweight='bold')

# 1. CMA Tracking Performance
axes[0,0].plot(df_eval['time'], df_eval['d50'], label='d50 (Actual)', color='blue', linewidth=2)
axes[0,0].axhline(y=df_eval['d50_target'].iloc[0], color='blue', linestyle='--', 
                  label=f'd50 Target ({df_eval["d50_target"].iloc[0]:.0f})', alpha=0.7)

ax1b = axes[0,0].twinx()
ax1b.plot(df_eval['time'], df_eval['lod'], label='LOD (Actual)', color='red', linewidth=2)
ax1b.axhline(y=df_eval['lod_target'].iloc[0], color='red', linestyle='--', 
             label=f'LOD Target ({df_eval["lod_target"].iloc[0]:.1f})', alpha=0.7)

axes[0,0].set_title('CMA Tracking Performance', fontweight='bold')
axes[0,0].set_xlabel('Time Steps')
axes[0,0].set_ylabel('d50 (μm)', color='blue')
ax1b.set_ylabel('LOD (%)', color='red')
axes[0,0].legend(loc='upper left')
ax1b.legend(loc='upper right')
axes[0,0].grid(True, alpha=0.3)

# 2. Control Actions
axes[0,1].plot(df_eval['time'], df_eval['spray_rate'], label='Spray Rate', linewidth=2)
axes[0,1].plot(df_eval['time'], df_eval['air_flow'], label='Air Flow', linewidth=2)
axes[0,1].plot(df_eval['time'], df_eval['carousel_speed'], label='Carousel Speed', linewidth=2)
axes[0,1].set_title('Control Actions (CPPs)', fontweight='bold')
axes[0,1].set_xlabel('Time Steps')
axes[0,1].set_ylabel('CPP Values')
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3)

# 3. Reward Evolution
axes[1,0].plot(df_eval['time'], df_eval['reward'], label='Instantaneous Reward', alpha=0.6)
# Rolling average for smoother visualization
rolling_reward = df_eval['reward'].rolling(window=20, center=True).mean()
axes[1,0].plot(df_eval['time'], rolling_reward, label='Reward (20-step avg)', linewidth=2, color='red')
axes[1,0].set_title('Reward Signal Evolution', fontweight='bold')
axes[1,0].set_xlabel('Time Steps')
axes[1,0].set_ylabel('Reward')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)

# 4. Action Changes (Policy Smoothness)
axes[1,1].plot(df_eval['time'], df_eval['action_spray_rate'], label='Δ Spray Rate', alpha=0.8)
axes[1,1].plot(df_eval['time'], df_eval['action_air_flow'], label='Δ Air Flow', alpha=0.8)
axes[1,1].plot(df_eval['time'], df_eval['action_carousel_speed'], label='Δ Carousel Speed', alpha=0.8)
axes[1,1].set_title('Action Changes (Policy Smoothness)', fontweight='bold')
axes[1,1].set_xlabel('Time Steps')
axes[1,1].set_ylabel('Action Delta')
axes[1,1].legend()
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# --- Performance Analysis ---
print("=== RL Policy Performance Analysis ===")
print()

# Calculate final tracking errors
final_d50_error = abs(df_eval['d50'].iloc[-1] - df_eval['d50_target'].iloc[-1])
final_lod_error = abs(df_eval['lod'].iloc[-1] - df_eval['lod_target'].iloc[-1])

# Calculate average tracking errors
avg_d50_error = abs(df_eval['d50'] - df_eval['d50_target']).mean()
avg_lod_error = abs(df_eval['lod'] - df_eval['lod_target']).mean()

print(f"Final Tracking Performance:")
print(f"  d50 Error: {final_d50_error:.2f} μm (Target: {df_eval['d50_target'].iloc[-1]:.0f}μm)")
print(f"  LOD Error: {final_lod_error:.3f} % (Target: {df_eval['lod_target'].iloc[-1]:.1f}%)")
print()

print(f"Average Tracking Performance:")
print(f"  d50 Error: {avg_d50_error:.2f} μm")
print(f"  LOD Error: {avg_lod_error:.3f} %")
print()

# Control smoothness analysis
action_variability = {
    'spray_rate': df_eval['action_spray_rate'].std(),
    'air_flow': df_eval['action_air_flow'].std(),
    'carousel_speed': df_eval['action_carousel_speed'].std()
}

print(f"Control Action Variability (lower is smoother):")
for param, std_val in action_variability.items():
    print(f"  {param}: {std_val:.3f}")
print()

print(f"Learning Metrics:")
print(f"  Total Cumulative Reward: {cumulative_reward:.2f}")
print(f"  Average Reward per Step: {cumulative_reward/len(df_eval):.3f}")
print(f"  Final Reward: {df_eval['reward'].iloc[-1]:.3f}")

### Final Analysis and Conclusions

The evaluation plots above demonstrate the performance of our RL policy that was learned entirely through trial and error interaction with the plant simulator. Key observations:

**Strengths of the RL Approach:**
- **Model-Free Learning**: The agent learns effective control strategies without requiring an explicit process model
- **Direct Policy Optimization**: Actions are chosen directly from states, potentially faster than MPC optimization
- **Adaptive Behavior**: The policy can adapt to different setpoints and process conditions through training diversity
- **Non-Linear Control**: Can discover complex, non-obvious control strategies that linear controllers might miss

**Implementation Insights:**
- **Reward Engineering**: The reward function design is critical - it must balance tracking performance with control smoothness
- **Environment Design**: Proper state representation (including targets) and action space bounds are essential
- **Training Time**: RL requires significant computational time but results in a fast inference policy
- **Exploration vs. Exploitation**: PPO balances trying new actions with exploiting known good strategies

**Comparison with MPC (V1/V2):**
- **MPC Advantages**: Explicit constraints, predictable behavior, interpretable optimization
- **RL Advantages**: No model required, potentially superior non-linear control, fast execution
- **Hybrid Potential**: RL could learn high-level strategies while MPC handles low-level optimization

This notebook completes the third and final pillar of the **AutoPharm V3** framework. By implementing Reinforcement Learning, we have opened the door to a fundamentally different class of control strategies that can potentially surpass traditional model-based approaches for complex, non-linear pharmaceutical processes.