# Reinforcement Learning Integration with NeqSim

This notebook demonstrates how **NeqSim** can be embedded into reinforcement learning (RL) workflows for process control and optimization.

- [Reinforcement learning (Wikipedia)](https://en.wikipedia.org/wiki/Reinforcement_learning)
- [OpenAI Gym](https://www.gymlibrary.dev/)
- [Introductory RL video](https://www.youtube.com/watch?v=2pWv7GOvuf0)


## Birth of Reinforcement Learning

The origins of RL trace back to early work on trial-and-error learning in the 1950s and the formalization of temporal-difference methods by Sutton and Barto in the 1980s.
The goal is to learn a policy $\pi$ that maximizes the expected discounted reward

$$ J(\pi) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right] $$

where $\gamma$ is a discount factor and $r_t$ is the reward at time $t$.


## Embedding NeqSim in an RL Environment

By wrapping NeqSim simulations inside an [OpenAI Gym](https://www.gymlibrary.dev/) interface, agents can interact with a simulated process to learn control policies.


In [None]:
import gym
from gym import spaces
# from neqsim import thermodynamics  # hypothetical import

class NeqSimEnv(gym.Env):
    """Minimal example of a NeqSim-powered environment."""
    metadata = {'render.modes': []}

    def __init__(self):
        super().__init__()
        self.observation_space = spaces.Box(low=-1.0, high=1.0, shape=(3,), dtype=float)
        self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(1,), dtype=float)

    def step(self, action):
        # Run NeqSim simulation step here
        # state = thermodynamics.run_process(action)
        state = self.observation_space.sample()
        reward = -0.1  # placeholder for thermodynamic reward
        done = False
        return state, reward, done, {}

    def reset(self):
        return self.observation_space.sample()


## Reward Design with Thermodynamic Insights

NeqSim's detailed thermodynamic calculations enable reward functions that capture operational objectives such as energy efficiency, cost, and safety. A generic reward can be expressed as

$$ r = -\alpha E - \beta C + \gamma S $$

where $E$ is energy usage, $C$ is operating cost, and $S$ is a safety metric computed from NeqSim outputs. Selecting suitable weights $\alpha$, $\beta$, and $\gamma$ guides the RL agent toward efficient and safe operation.
