# Context: Delays and Overshoot in Real-World CSTR Operations

* Actuator Deadtime:
    Opening or closing a valve (e.g., adjusting $F_{in}$) might not be instantaneous. The valve might take a few seconds or minutes to fully respond to the control signal.

* Process Deadtime:
    Chemical reactions, especially in large reactors, often have transport and reaction delays. The concentration or temperature measurement might only reflect a control action taken some time ago.

* Consequence:
    Without accounting for these delays, a typical RL agent (or standard PID) might keep incerasing the jacket temperature ($T_c$) because it doesn't immediately see the effect of its action. By the time the effect becomes apparent, it might have overshot the setpoint significantly, then it reacts aggresively in the other direction, causing osciallations. 


# 1. Incorporating Time-Delay Awareness into RL

## 1.1 LSTM in Actor/Critic Netowrks

1. Motivation:
    * An LSTM (Long Short-Term Memory) layer captures temporal dependencies and delayed cause-and-effect relationships in the system.
    * By storing recent history of states (and potentially actions), the RL agent can infer when the system is still responding to a prior adjustment.

2. Mechanics:
    * Actor Network:
        * Instead of Actor ($s_t$) being a simple feedback netowkr, make it ActorLSTM({$s_t, s_{t-1}, ...$})
        * The LSTM's hidden state effectively tracks delayed system responses, enabling the policy to anticipate overshoot. 
    * Critic Network:
        * Similarly incorportate LSTM in Critic ($s_t, a_t$)
        * The critic learns that an action taken at time $t - \Delta$ might only show up in the measured concentration at time $t$.

3. Reward shaping:
    * Provide negative rewards for large overshoot or high control action rates
    * The LSTM helps the agent realize it should wait for the reaction to "catch up" rather than continuing to push the same control input.


## 1.2 Prediction-Based or Model-Based Approaches:



# 2. Practical Solutions for Deadtime and Overshoot

1. Deadtime-Aware States:
    The RL state could include delayed actions or a time-shifted buffer of past actions. For exapmle, if we know the valve opening will appear in the measurement after 3 time units, the RL agent sees both the intended action from time $t$ and the actual effect only at $t + 3$

2. Delayed observations:
    If the environment does not show changes in temperature or concentration until some steps later, the LSTM can store the prior actions and states so the agent can infer, "We are still waiting on the effect of the last big temperature increase."

3. Overshoot Penalization:
    * Enhance the reward function:
    $$
    r_t = -\alpha (\text{tracking\_error}_t)^2 - \beta (\text{control\_effort}_t)^2 - \gamma \max(0, \text{overshoot}_t)
    $$
    * Where the overshoot$_t$ is the positive difference if the measurement goes above the setpoint. This explicitly teaches the agent to avoid large overshoot. 

4. Adaptive Critic Frequency + Delay Buffer
    * Combine the idea of adaptive critic frequency with a delay buffer inside the environment
    * Each time the agent acts, the environment internally queues the effect of that action to apply it $\tau$ steps later, simulating deadtime
    * The agent's LSTM then accounts for how older actions might be just now taking effect

5. Decision Tree Branches for Delay:
    * In the decision-tree approach, certain branches might specifically handle large known delays, for instance
        * Branch condition: "If current time < (last_setpoint_change_time + deadtime_constant), expect delayed effect".
        * This branch might instruct the RL agent (or the hierarchical sub-policy) to moderate changes






# 3. Illustrative Extended Pseudocode

In [None]:
# LSTM-Enhanced Actor for Deadtime Handling
class LSTMActorNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, action_dim):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, seq_inputs):
        # seq_inputs: shape (batch, time_seq, input_dim)
        output, (h_n, c_n) = self.lstm(seq_inputs)
        # take the last time step hidden state
        last_output = output[:, -1, :]
        action = torch.tanh(self.fc(last_output))  # ensure [-1,1] for PID gains
        return action

# Environment with Delayed Action Effect
class DelayedCSTREnvironment:
    def __init__(self, deadtime_steps=2, ...):
        self.deadtime_steps = deadtime_steps
        self.action_queue = collections.deque(maxlen=deadtime_steps+1)
        ...
    
    def step(self, action):
        # Store the new action in a queue
        self.action_queue.append(action)
        
        # Apply the oldest action from the queue if enough time has passed
        if len(self.action_queue) >= self.deadtime_steps:
            actual_action = self.action_queue[0]
        else:
            actual_action = np.zeros_like(action)  # or default action
        # Next, simulate environment with 'actual_action'
        next_state, reward, done, info = cstr_simulation(actual_action)
        ...
        return next_state, reward, done, info

# Adaptive RL Training with LSTM and Delayed Environment
for episode in range(num_episodes):
    seq_state = []
    seq_action = []
    state = delayed_env.reset()
    hidden_lstm = actor.init_hidden()  # if needed
    
    while not done:
        # Collect temporal inputs for LSTM
        seq_state.append(state)
        
        # If we have enough history, feed to LSTM
        if len(seq_state) >= seq_length:
            batch_seq = format_batch(seq_state, ...
            action = actor(batch_seq)
            
            # Evaluate environment
            next_state, reward, done, info = delayed_env.step(action)
            
            # Manage overshoot penalty, error derivative
            reward += compute_overshoot_penalty(next_state)
            
            # Critic update possibly with multi-timescale logic
            # ...
            
            seq_state.pop(0)  # remove oldest
        else:
            # If not enough history, pick an initial action
            action = np.zeros(action_dim)
            next_state, reward, done, info = delayed_env.step(action)
        
        state = next_state


# 4. Benefits of This Enhanced Design

1. Overshoot Reduction:
    Because the LSTM "remembers" previous big changes in temperature/flow, the agent learns to wait for delayed effects before continuing to push the same control variable. 
2. More Realistic Control:
    The environment's step function includes a queue for delayed actions, mimicking real-world slow valve (or other action) responses. 
3. Intrepretability Remains Feasible:
    The decision tree or "supervisory strucutre" can still provide rule-based logic about operating regimes.
    The LSTM handles the more subtle aspects of delayed cause-and-effect
4. Ease of Transfer to Real Plant
    Modeling the deadtime in simulation means the policy is more likely to handle real delays

# Another wild idea: LSTM + Kalman Filter enhanced RL

## 1. Overall Architecture

This system combines three key components:

* **Kalman Filter**: A model-based estimation that refines noisy measurements from the nonlinear CSTR system
* **LSTM network**: A recurrent neural network that takes a sequence of filtered state estimates and extracts temporal features, providing an enhanced state representation
* **Reinforcement Learning**: Using a Soft Actor Critic framework, the agent learns to output PID controller tuning parameters (six gains) based on the enhanced state

Together, these components work in tandem to adaptively tune the PID controller for a nonlinear system with noise and deadtime

# 2. Detailed Implementation Walkthrough

a. Kalman filter:
* Purpose:
    The Kalman filter is used to perform state estimation despite noisy sensor measurements. In a nonlinear system like the CSTR, even though the Kalman filter assumes linear dynamics, it can still provide a good approximation for noise reduction. 

* Implementation Details:
    * Initialization: The filter starts with an initial state vector (x_hat) and covariance (P)
    * Predict Step: Uses the state transition matrix (F) and process noise covariance (Q) to predict the next state
    * Update Step: When a new measurement arrives, it computes the innovation (residual between measurement and prediction), calculates the Kalman gain, and updates both the state estimate and covariance
    * Customization: You have an option to extract state variables (e.g., concentration, temperature, volume) from the full-state vector to focus the estimation on the most relevant aspects.

b. LSTM State Enhancer
* Purpose:
    The LSTM network processes a sliding window of state estimates (produced by the Kalman filter) to capture time-dependent patterns or trends. This is particularly important in systems with delays or deadtime, as the current state might depend on past states.
* Implementation Details: 
    
    