# Chapter 54: Reinforcement Learning for Time-Series

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand the fundamentals of reinforcement learning (RL) and its applicability to time‑series prediction and trading
- Formulate a time‑series problem as a Markov Decision Process (MDP) with states, actions, rewards, and transitions
- Implement tabular Q‑learning for discrete state‑action spaces and understand its limitations
- Apply Deep Q‑Networks (DQN) to handle continuous state spaces using neural networks
- Understand policy gradient methods and their advantages for continuous action spaces
- Implement actor‑critic methods (A2C, PPO) for stable and efficient learning
- Design custom environments for financial time‑series using libraries like Gymnasium
- Address exploration‑exploitation trade‑offs with epsilon‑greedy, Boltzmann exploration, and noise injection
- Construct reward functions that align with trading objectives (e.g., profit, risk‑adjusted returns)
- Recognize practical challenges when applying RL to financial markets: non‑stationarity, overfitting, transaction costs, and market impact

---

## Introduction

In previous chapters, we treated stock prediction as a supervised learning problem: given historical features, we predict the next day's direction or price. But what if we want an agent that not only predicts but also **decides** when to buy, sell, or hold? This is where **reinforcement learning** (RL) shines. RL is a paradigm where an agent learns to make sequential decisions by interacting with an environment, receiving rewards or penalties, and aiming to maximise cumulative reward.

For the NEPSE system, an RL agent could be trained to trade a portfolio of stocks. It would observe the market state (prices, volumes, technical indicators), take actions (buy, sell, hold), and receive rewards based on profit or loss. Over time, it learns a policy that maximises long‑term returns, potentially adapting to changing market conditions.

This chapter introduces reinforcement learning concepts and algorithms, with a focus on time‑series applications. We will build a simple trading environment for NEPSE data and implement several RL algorithms using Python and libraries such as Gymnasium, Stable‑Baselines3, and PyTorch.

---

## 54.1 Reinforcement Learning Fundamentals

Reinforcement learning is characterised by an **agent** interacting with an **environment**. At each time step `t`, the agent observes a **state** `s_t`, selects an **action** `a_t`, and receives a **reward** `r_t`. The environment then transitions to a new state `s_{t+1}`. The goal is to learn a **policy** `π(a|s)` that maximises the expected cumulative discounted reward, often called the **return**:

`G_t = r_t + γ r_{t+1} + γ² r_{t+2} + …`

where `γ ∈ [0,1]` is the discount factor that balances immediate and future rewards.

Key components:

- **Policy**: The agent's behaviour, mapping states to actions. It can be deterministic (`a = μ(s)`) or stochastic (`π(a|s)`).
- **Value function**: The expected return starting from state `s` and following policy `π`: `V^π(s) = E[G_t | s_t = s]`.
- **Action‑value function** (Q‑function): The expected return starting from state `s`, taking action `a`, then following policy `π`: `Q^π(s,a) = E[G_t | s_t = s, a_t = a]`.
- **Model**: The agent's representation of the environment dynamics (transition probabilities and reward function). RL methods can be **model‑based** (learn the model) or **model‑free** (learn directly from experience).

For financial trading, the environment is the market, the state includes market data and possibly agent's holdings, actions are trading decisions, and rewards are typically profit (or risk‑adjusted profit).

---

## 54.2 Formulating the Trading Problem as an MDP

To apply RL, we must define the state space, action space, reward function, and transition dynamics (though transitions are often learned from data).

### State Space

The state should capture all relevant information for decision making. For NEPSE trading, the state at time `t` could include:

- Current price and volume (possibly normalized)
- Technical indicators (SMA, RSI, MACD) over recent windows
- Current portfolio holdings (cash, shares of each stock)
- Time features (day of week, month)

We can represent the state as a vector of numerical values. To respect the temporal nature, the state must be constructed from information available up to time `t` only (no look‑ahead).

### Action Space

Actions can be discrete or continuous:

- **Discrete**: For each stock, actions could be {Buy, Sell, Hold}. If multiple stocks are traded simultaneously, the action space becomes combinatorial.
- **Continuous**: The action could be the fraction of portfolio to allocate to each stock (e.g., portfolio weights). This is more realistic but also more challenging.

For simplicity, we start with a single stock and three discrete actions.

### Reward Function

The reward should reflect the trading objective. Common choices:

- **Profit**: `r_t = (price_{t+1} - price_t) * shares_held` (unrealised profit) or realised profit after a sell.
- **Risk‑adjusted reward**: Sharpe ratio or profit minus penalty for volatility.
- **Including transaction costs**: `r_t = profit - cost * |trade_size|`.

We must be careful: rewards based on unrealised profits can encourage excessive risk‑taking. A common approach is to reward only realised profits when a trade is closed.

### Transitions

The environment's next state depends on the market's evolution, which is not known to the agent. In a model‑free RL setting, the agent learns directly from historical data by replaying sequences of states, actions, and rewards.

---

## 54.3 Tabular Q‑Learning

Q‑learning is a model‑free algorithm that learns the optimal action‑value function `Q*(s,a)`. The update rule:

`Q(s_t, a_t) ← Q(s_t, a_t) + α [ r_t + γ max_a Q(s_{t+1}, a) - Q(s_t, a_t) ]`

For tabular Q‑learning, we maintain a table of Q‑values for every state‑action pair. This is only feasible when the state space is small and discrete. Our trading state, however, is continuous. We could discretise the state variables (e.g., bin prices into low/medium/high), but this loses information and suffers from the curse of dimensionality.

Nonetheless, for illustration, we can build a simple discretised trading environment.

**Example: Discretised Trading with Q‑Learning**

```python
import numpy as np
import pandas as pd

class DiscretizedTradingEnv:
    def __init__(self, prices, n_bins=10):
        self.prices = prices
        self.n_bins = n_bins
        self.current_step = 0
        self.position = 0  # -1: short, 0: neutral, 1: long
        self.cash = 0
        self.max_steps = len(prices) - 1

        # Discretize price into bins
        price_min, price_max = prices.min(), prices.max()
        self.bins = np.linspace(price_min, price_max, n_bins+1)
        self.bin_indices = np.digitize(prices, self.bins) - 1

    def reset(self):
        self.current_step = 0
        self.position = 0
        self.cash = 0
        return self._get_state()

    def _get_state(self):
        # State: (price bin, current position)
        price_bin = self.bin_indices[self.current_step]
        return (price_bin, self.position + 1)  # position shifted to 0,1,2

    def step(self, action):
        # Actions: 0=buy, 1=sell, 2=hold
        price = self.prices[self.current_step]
        reward = 0
        if action == 0:  # buy
            if self.position == 0:
                self.position = 1
                self.cash -= price
            # else already long, do nothing (or could be invalid)
        elif action == 1:  # sell
            if self.position == 1:
                self.position = 0
                reward = self.cash + price - 0  # profit (simplified)
                self.cash = 0
            elif self.position == 0:
                self.position = -1
                self.cash += price
        # else hold: do nothing

        self.current_step += 1
        done = self.current_step >= self.max_steps
        next_state = self._get_state() if not done else None
        return next_state, reward, done, {}
```

Now we can run Q‑learning on this environment.

```python
# Initialize Q-table: states = (price_bin, position) -> 3 actions
n_price_bins = 10
n_positions = 3
Q = np.zeros((n_price_bins, n_positions, 3))

env = DiscretizedTradingEnv(prices)
alpha = 0.1
gamma = 0.95
epsilon = 0.1
episodes = 1000

for ep in range(episodes):
    state = env.reset()
    done = False
    while not done:
        # epsilon-greedy
        if np.random.rand() < epsilon:
            action = np.random.randint(3)
        else:
            action = np.argmax(Q[state[0], state[1], :])

        next_state, reward, done, _ = env.step(action)
        if next_state is not None:
            # Q-learning update
            best_next = np.max(Q[next_state[0], next_state[1], :])
            td_target = reward + gamma * best_next
        else:
            td_target = reward

        Q[state[0], state[1], action] += alpha * (td_target - Q[state[0], state[1], action])
        state = next_state
```

**Explanation:**  
We discretise the price into bins and use position as part of the state. The Q‑table is updated using the Bellman equation. This simple agent can learn a policy, but its performance is limited by the coarse discretisation.

---

## 54.4 Deep Q‑Networks (DQN)

To handle continuous state spaces, we can approximate the Q‑function with a neural network: `Q(s,a; θ)`. The **Deep Q‑Network (DQN)** algorithm uses a replay buffer to store experiences and a target network to stabilise training.

**Algorithm outline:**
- Maintain a replay buffer of tuples `(s, a, r, s', done)`.
- Sample a mini‑batch uniformly from the buffer.
- Compute target: `y = r + γ max_a' Q_target(s', a')` (if not done).
- Update the main network by minimising `(y - Q_main(s,a))²`.
- Periodically copy main network weights to target network.

**Implementing DQN for NEPSE Trading with Stable‑Baselines3**

Stable‑Baselines3 provides robust implementations of DQN and other RL algorithms. First, we need to create a custom Gym environment.

```python
import gymnasium as gym
from gymnasium import spaces
import numpy as np
import pandas as pd

class NEPSEtradingEnv(gym.Env):
    """
    Custom Environment for NEPSE trading.
    State: vector of features (price, volume, technical indicators, position)
    Actions: 0=hold, 1=buy, 2=sell
    Reward: change in portfolio value after transaction costs.
    """
    def __init__(self, df, window_size=10, transaction_cost=0.001):
        super().__init__()
        self.df = df.reset_index(drop=True)
        self.window_size = window_size
        self.transaction_cost = transaction_cost
        self.n_features = len(df.columns)  # assume df already has features

        # Action space: discrete 0,1,2
        self.action_space = spaces.Discrete(3)

        # Observation space: window of past features + position indicator
        self.observation_space = spaces.Box(
            low=-np.inf, high=np.inf,
            shape=(window_size, self.n_features + 1), dtype=np.float32
        )

    def _get_obs(self):
        # Get last `window_size` rows of features
        end = self.current_step
        start = end - self.window_size
        obs = self.df.iloc[start:end].values
        # Add position as an extra feature at each time step (repeat)
        position_feature = np.full((self.window_size, 1), self.position)
        obs = np.concatenate([obs, position_feature], axis=1)
        return obs.astype(np.float32)

    def reset(self, seed=None):
        super().reset(seed=seed)
        self.current_step = self.window_size
        self.position = 0  # 0: no position, 1: long, -1: short
        self.cash = 1.0  # initial capital normalized
        self.shares = 0
        return self._get_obs(), {}

    def step(self, action):
        price = self.df.iloc[self.current_step]['Close']  # assume 'Close' column
        done = False
        reward = 0

        # Execute action
        if action == 1:  # buy
            if self.position == 0:
                # Buy with all cash
                self.shares = self.cash / price * (1 - self.transaction_cost)
                self.cash = 0
                self.position = 1
            # else already long, do nothing (or could allow increasing position)
        elif action == 2:  # sell
            if self.position == 1:
                # Sell all shares
                self.cash = self.shares * price * (1 - self.transaction_cost)
                self.shares = 0
                self.position = 0
            elif self.position == 0:
                # Short (simplified)
                self.shares = -self.cash / price * (1 - self.transaction_cost)
                self.cash = 0
                self.position = -1
        # else hold: do nothing

        # Move to next step
        self.current_step += 1
        if self.current_step >= len(self.df) - 1:
            done = True

        # Compute reward as change in total portfolio value
        new_price = self.df.iloc[self.current_step]['Close'] if not done else price
        if self.position == 1:
            new_value = self.shares * new_price
        elif self.position == -1:
            new_value = -self.shares * new_price  # short position value negative? simplify
        else:
            new_value = self.cash
        reward = new_value - (self.cash + self.shares * price)  # delta

        next_obs = self._get_obs() if not done else None
        return next_obs, reward, done, False, {}
```

Now we can use Stable‑Baselines3 to train a DQN agent.

```python
from stable_baselines3 import DQN
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.callbacks import EvalCallback

# Create environment
df = pd.read_csv('nepse_features.csv')  # assume preprocessed
env = NEPSEtradingEnv(df)
check_env(env)  # verify environment

# Instantiate DQN agent
model = DQN('MlpPolicy', env, verbose=1,
            learning_rate=1e-3,
            buffer_size=50000,
            learning_starts=1000,
            batch_size=32,
            tau=0.1,
            gamma=0.99,
            exploration_fraction=0.1,
            exploration_final_eps=0.02)

# Optional: evaluation callback
eval_env = NEPSEtradingEnv(df)  # separate env for evaluation
eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/',
                             log_path='./logs/', eval_freq=1000,
                             deterministic=True, render=False)

# Train
model.learn(total_timesteps=50000, callback=eval_callback)

# Save
model.save("dqn_nepse_trader")
```

**Explanation:**  
The environment provides a state consisting of a window of features plus the current position. DQN uses a neural network to approximate Q‑values for each action. The agent learns by interacting with historical data, but careful: training on a single historical path may cause overfitting. Often we use multiple episodes with random start points or a sliding window.

---

## 54.5 Policy Gradient Methods

Instead of learning a value function and deriving a policy, policy gradient methods directly optimise the policy `π(a|s; θ)` using gradient ascent on expected return. The REINFORCE algorithm is a simple policy gradient method:

`∇θ J(θ) ≈ E[ ∑_t ∇θ log π(a_t|s_t; θ) G_t ]`

where `G_t` is the return from time `t`. This is an unbiased estimate but has high variance.

**Advantages:**
- Naturally handles continuous action spaces.
- Can learn stochastic policies (useful for exploration).
- Often more stable for certain problems.

**Example: REINFORCE for NEPSE trading**

We'll use PyTorch to implement a simple policy network.

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, n_actions):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, n_actions)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return torch.softmax(self.fc3(x), dim=-1)

def reinforce(env, policy, optimizer, episodes=1000, gamma=0.99):
    for episode in range(episodes):
        log_probs = []
        rewards = []
        state = env.reset()
        done = False
        while not done:
            state_t = torch.FloatTensor(state).unsqueeze(0)
            probs = policy(state_t)
            m = Categorical(probs)
            action = m.sample()
            log_prob = m.log_prob(action)
            next_state, reward, done, _ = env.step(action.item())
            log_probs.append(log_prob)
            rewards.append(reward)
            state = next_state

        # Compute discounted returns
        returns = []
        R = 0
        for r in reversed(rewards):
            R = r + gamma * R
            returns.insert(0, R)
        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)  # normalize

        # Compute loss
        loss = []
        for log_prob, R in zip(log_probs, returns):
            loss.append(-log_prob * R)
        loss = torch.cat(loss).sum()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Usage
env = NEPSEtradingEnv(df)
policy = PolicyNetwork(input_dim=env.observation_space.shape[0]*env.observation_space.shape[1], 
                       hidden_dim=64, n_actions=3)
optimizer = optim.Adam(policy.parameters(), lr=1e-3)
reinforce(env, policy, optimizer)
```

**Explanation:**  
The policy network outputs probabilities over actions. The loss is the negative log‑probability of the taken action weighted by the return. Over episodes, the policy shifts towards actions that lead to higher returns.

---

## 54.6 Actor‑Critic Methods

Actor‑critic methods combine the benefits of value‑based and policy‑based methods. They have two components:

- **Actor**: the policy network that selects actions.
- **Critic**: a value network (usually state‑value `V(s)` or advantage) that evaluates the actor's choices, reducing variance.

**Advantage Actor‑Critic (A2C)** uses the advantage function `A(s,a) = Q(s,a) - V(s)` to weight the policy gradient. The critic learns to estimate `V(s)`, and the actor is updated with:

`∇θ J(θ) ≈ E[ ∇θ log π(a|s) A(s,a) ]`

**Proximal Policy Optimization (PPO)** is a more advanced actor‑critic method that clips the policy update to prevent too large changes, ensuring stable learning. Stable‑Baselines3 provides PPO and A2C implementations.

**Example: Training a PPO agent for NEPSE trading**

```python
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv

# Vectorise environment (for parallel training, optional)
env = DummyVecEnv([lambda: NEPSEtradingEnv(df)])

# PPO agent
model = PPO('MlpPolicy', env, verbose=1,
            learning_rate=3e-4,
            n_steps=2048,
            batch_size=64,
            n_epochs=10,
            gamma=0.99,
            gae_lambda=0.95,
            clip_range=0.2)

model.learn(total_timesteps=100000)
model.save("ppo_nepse_trader")
```

**Explanation:**  
PPO is often more robust than DQN and can handle continuous action spaces (if we extend to portfolio allocation). It uses a clipped surrogate objective to avoid destructive large policy updates.

---

## 54.7 Time‑Series RL Applications

Beyond trading, RL can be applied to other time‑series tasks:

- **Dynamic asset allocation**: Allocate portfolio weights across multiple assets.
- **Order execution**: Learn to split a large order to minimise market impact.
- **Market making**: Provide liquidity by placing limit orders.
- **Parameter adaptation**: Adjust model hyperparameters (e.g., of a prediction model) in response to market changes.

For the NEPSE system, a multi‑asset trading agent could manage a portfolio of top stocks, rebalancing periodically. The state would include features for each stock and current holdings.

---

## 54.8 Exploration Strategies

In RL, the agent must explore to discover rewarding actions. Common strategies:

- **ε‑greedy**: With probability ε, take a random action; otherwise, take the greedy action. Used in DQN.
- **Boltzmann (softmax) exploration**: Sample actions according to their estimated value, with a temperature parameter controlling randomness.
- **Noise injection**: Add noise to the policy (e.g., Ornstein‑Uhlenbeck process for continuous actions).
- **Upper Confidence Bound (UCB)**: Select actions based on both estimated value and uncertainty.

For financial applications, exploration must be cautious to avoid large losses. Often, we start with high exploration and gradually decay (e.g., ε decay in DQN). In policy gradient methods, the stochastic policy itself provides exploration.

---

## 54.9 Reward Design

Reward design is critical in RL. A poorly designed reward can lead to unintended behaviour. For trading:

- **Simple profit**: `reward = profit` after each step (or at the end of an episode). This may encourage the agent to hold positions through adverse moves.
- **Risk‑adjusted profit**: Penalise volatility or drawdowns. For example, `reward = profit - λ * (return_std)`.
- **Sharpe ratio**: Reward per unit of risk. Can be computed over a window.
- **Realised P&L only**: Reward only when a trade is closed (sell). This avoids rewarding unrealised gains.

Also, include **transaction costs** to discourage excessive trading. A typical reward step:

```python
# After executing action, compute portfolio value change
new_value = cash + shares * current_price
reward = new_value - previous_value - transaction_cost * abs(trade_volume)
```

---

## 54.10 Practical Considerations

Applying RL to financial time‑series is challenging due to:

- **Non‑stationarity**: Market dynamics change over time. An agent trained on past data may fail in the future. Solutions: retrain periodically, use adaptive algorithms, or include recent data in training.
- **Overfitting**: With a single historical path, the agent can memorise patterns. Use multiple episodes with different starting points, and validate on out‑of‑sample periods.
- **Transaction costs and liquidity**: Realistic simulations must include costs, slippage, and limited liquidity (orders affect prices).
- **Evaluation**: Use walk‑forward testing: train on period 1, test on period 2; retrain on period 1+2, test on period 3, etc. Compute average performance.
- **Risk management**: An RL agent might take excessive risks. Incorporate risk constraints (e.g., maximum position size, stop‑loss) either in the environment or in the reward.
- **Computational resources**: RL training can be slow. Use vectorised environments and GPUs for neural networks.

**Example: Walk‑forward validation**

```python
import pandas as pd
from stable_baselines3 import PPO

# Assume df has a 'Date' column
dates = pd.to_datetime(df['Date'])
train_start, train_end = '2020-01-01', '2021-12-31'
test_start, test_end = '2022-01-01', '2022-12-31'

# Split data
train_df = df[(dates >= train_start) & (dates <= train_end)]
test_df = df[(dates >= test_start) & (dates <= test_end)]

# Create envs
train_env = DummyVecEnv([lambda: NEPSEtradingEnv(train_df)])
test_env = NEPSEtradingEnv(test_df)

# Train
model = PPO('MlpPolicy', train_env, verbose=0)
model.learn(total_timesteps=50000)

# Evaluate on test
obs = test_env.reset()
total_reward = 0
done = False
while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _ = test_env.step(action)
    total_reward += reward
print(f"Test total reward: {total_reward}")
```

---

## Chapter Summary

In this chapter, we introduced reinforcement learning and its application to time‑series prediction and trading, using the NEPSE stock market as a running example. We covered:

- The fundamental concepts of RL: agent, environment, state, action, reward, policy, value functions.
- How to formulate a trading problem as a Markov Decision Process (MDP).
- Tabular Q‑learning and its limitations for continuous states.
- Deep Q‑Networks (DQN) with experience replay and target networks, implemented using Stable‑Baselines3.
- Policy gradient methods (REINFORCE) and their advantages.
- Actor‑critic methods (A2C, PPO) for stable and efficient learning.
- Designing custom Gym environments for trading.
- Exploration strategies and reward design tailored to financial objectives.
- Practical challenges like non‑stationarity, overfitting, transaction costs, and validation.

Reinforcement learning offers a powerful framework for developing adaptive trading strategies that go beyond simple prediction. However, it requires careful design and rigorous validation to succeed in the real world. For the NEPSE system, RL could be used to create a trading agent that learns to exploit patterns in Nepalese stocks, but must be constantly monitored and retrained.

In the next chapter, we will discuss **Probabilistic Forecasting**, which provides uncertainty estimates alongside predictions—a crucial aspect for risk management.

---

**End of Chapter 54**