## Reinforcement Learning in Option Hedging and Market Making

---

#### **0. Motivation and Big Picture**

You are given a **simplified BTC perpetual market-making strategy** that earns small spreads but can accumulate **large directional inventory** during trending moves.

This assignment focuses on designing a **risk-management layer** using **options**, trained and evaluated inside a **simulated environment**.

The high-level tasks:

1. **Build a BTC price simulator**
   combining
   **QED nonlinear diffusion** (continuous part) and
   **Hawkes jumps** (event part),
   calibrated using real BTC data.

2. **Construct a simple option market layer**
   (ATM / ±10% moneyness, multiple maturities, BS pricing).

3. **RL hedging strategies**
    reinforcement learning hedge
    under the same simulated environment.

**Simplifying assumptions**

BTC perpetual
- No funding-rate effects.
- No maker fees.
- No slippage, partial fills or execution delays.

Option hedging
- Each option trade costs 0.05% of notional.
- No market impact, full liquidity.

---

#### **1. Data & Stylized Facts (BTC 5-Minute OHLCV)**

You are provided BTC **5-minute OHLCV** data.

Your objective is not precise statistical modeling but extracting **stylized facts** that guide model calibration.

You may choose any reasonable analyses, but typical stylized facts include: Return distribution, Volatility behaviour, Jump-like behaviour, Autocorrelations.

#### **1.1 Tasks**

- Load data, plot BTC price.
- Compute log-returns.
- Show stylized facts you consider relevant.
- Identify empirical features that should influence:
  - QED diffusion calibration (coarse scale)
  - jump detection threshold
  - Hawkes excitation parameters
- Briefly explain your choices.


---

#### **2. BTC Price Simulator (QED Diffusion + Hawkes Jumps)**

We construct a BTC price simulator combining:

1. a **QED-style nonlinear diffusion**, calibrated by **MLE** on **1h prices**, and
2. **self-exciting Hawkes jumps**, calibrated from **5-minute data** using
   volatility-based jump detection.


We model BTC on two time scales:

1. Slow nonlinear reversion + growth behavior
(macro/meso-scale → 1h grid)
captured by the QED diffusion.

2. Fast, bursty jump activity
(micro/meso-scale → 5-min grid)
captured by Hawkes-driven jump arrivals.

| Time scale | Purpose                              | Model             |
| ---------- | ------------------------------------ | ----------------- |
| 1h         | capture macro drift + mean reversion | QED diffusion MLE |
| 5-min      | detect / model jumps                 | Hawkes process    |
| 5-min      | combine drift + diffusion + jumps    | full simulator    |

We calibrate the QED parameters on 1-hour BTC returns rather than 5-minute data, because 5m returns are dominated by microstructure noise, which would make the nonlinear QED drift parameters unstable and hard to identify.

---

### 2.1 QED diffusion calibration on 1-hour BTC prices

We calibrate the QED diffusion on 1-hour BTC log-prices. 

---

#### 2.1.1 QED dynamics in log-price

Let the log-price be

$$
y_t = \log X_t.
$$

In log-space, the QED model satisfies the Langevin equation

$$
dy_t
=
- \frac{\partial V(y_t)}{\partial y}\, dt
+
\sigma\, dW_t,
$$

with potential

$$
V(y)
=
-
\left(
\theta - \frac{\sigma^2}{2}
\right) y
+
\kappa e^{y}
+
\frac{1}{2} g e^{2y}.
$$

The drift is therefore

$$
a(y)
=
\left(
\theta - \frac{\sigma^2}{2}
\right)
-
\kappa e^{y}
-
g e^{2y}.
$$

Thus,

$$
dy_t = a(y_t)\, dt + \sigma\, dW_t.
$$

---

#### 2.1.2 Discretization (1-hour grid)

Using Euler discretization:

$$
y_{t+\Delta t}
=
y_t
+
a(y_t)\,\Delta t
+
\sigma \sqrt{\Delta t}\,\varepsilon_t,
\qquad
\varepsilon_t \sim N(0,1).
$$

So,

$$
y_{t+\Delta t} \mid y_t
\sim
\mathcal{N}(m_t, v_t),
$$

with

$$
m_t = y_t + a(y_t)\,\Delta t,
\qquad
v_t = \sigma^2 \Delta t.
$$

---

#### 2.1.3 Negative log-likelihood

Given observations

$$
y_0,\; y_1,\;\dots,\; y_T,
$$

the negative log-likelihood (up to a constant) is

$$
\mathrm{NLL}(\Theta)
=
\frac{1}{2}
\sum_{t=0}^{T-1}
\left[
\log v_t(\Theta)
+
\frac{
(y_{t+\Delta t} - m_t(\Theta))^2
}{
v_t(\Theta)
}
\right],
$$

where

$$
\Theta = (\theta, \kappa, g, \sigma).
$$

The calibrated parameters satisfy

$$
\hat{\Theta}
=
\arg\min_{\Theta}
\mathrm{NLL}(\Theta).
$$

---

#### **2.2 Jump Detection on 5-Minute Data**

We now move to the finer grid:

$$
t_n = n \Delta t,\qquad
\Delta t = 5\text{ minutes}.
$$

Let:

$$
y_n = \log S_{t_n}, \qquad
r_n = y_{n+1} - y_n.
$$

We classify large moves as jumps using a volatility threshold.

Let $\hat{\sigma}_{5m}$ be a robust estimator (MAD or rolling std).
Choose threshold $\alpha*{\text{jump}}$ (e.g. 3–5).

Define:

* **Up-jump** if
  $$ r_n > \alpha_{\text{jump}}\,\hat{\sigma}_{5m}. $$
* **Down-jump** if
  $$ r_n < -\alpha_{\text{jump}}\,\hat{\sigma}_{5m}. $$
* Otherwise no jump.

Define indicators:
$$
N_n^+ = \begin{cases}
1, & r_n > \alpha_{\text{jump}}\sigma_{5m}, \\
0, & \text{otherwise},
\end{cases}
\qquad
N_n^- = \begin{cases}
1, & r_n < -\alpha_{\text{jump}}\sigma_{5m}, \\
0, & \text{otherwise}.
\end{cases}

$$


Jump sizes:

* Up-jump size:
  $$ J_n^+ = r_n \quad \text{if } N_n^+=1. $$

* Down-jump size:
  $$ J_n^- = -r_n \quad \text{if } N_n^-=1. $$

We fit distributions $F^+$ and $F^-$ to these empirical jump sizes
(parametric or empirical resampling).

---

#### **2.3 Hawkes Jump Intensity Model (MLE on 5-min Jumps)**

We model **self-exciting jump arrival rates**:

### **Intensity processes**

For up-jumps:

$$
\lambda_n^+
= \lambda_0^+

+ \alpha^+ \sum_{m<n} e^{-\beta (t_n - t_m)} N_m^+.
  $$

For down-jumps:

$$
\lambda_n^-
= \lambda_0^-

+ \alpha^- \sum_{m<n} e^{-\beta (t_n - t_m)} N_m^-.
  $$

Stability condition:

$$
\nu^\pm = \frac{\alpha^\pm}{\beta} < 1.
$$

### **Discrete-time Hawkes likelihood**

We approximate 5-minute arrivals as Bernoulli with probability
$p_n^\pm = \lambda_n^\pm \Delta t$.

The log-likelihood for up-jumps:

$$
\ell^+
=

\sum_n
\Big[
N_n^+ \log(\lambda_n^+ \Delta t)
-

\lambda_n^+ \Delta t
\Big].
$$

Similarly for down-jumps, $\ell^-$.

Total Hawkes likelihood:

$$
\ell_{\text{Hawkes}}
=

\ell^+ + \ell^-.
$$

We maximize over:

$$
(\lambda_0^+,\lambda_0^-,\alpha^+,\alpha^-,\beta).
$$

This yields:

* baselines $\hat{\lambda}_0^\pm$
* self-excitation strengths $\hat{\alpha}^\pm$
* decay rate $\hat{\beta}$

---

#### **2.4 Full BTC Simulator (QED + Hawkes Jumps)**

We now combine:

* **QED diffusion** (coarse-scale calibrated)
* **Hawkes jumps** (fine-scale calibrated)

The simulator runs on the **5-minute grid** with log-price $y_n$:

#### **Full dynamics**

$$
y_{n+1}
= y_n
+ \Big( \frac{\mu(e^{y_n})}{e^{y_n}} - \tfrac{1}{2}\sigma^2 \Big)\,\Delta t
+ \sigma \sqrt{\Delta t}\,\varepsilon_n
+ N_n^+ J_n^+
- N_n^- J_n^-.
  $$

where:

* $\mu(x) = \theta x - \kappa x^2 - g x^3$ (QED drift)
* $\varepsilon_n \sim N(0,1)$
* $N_n^\pm$ are Bernoulli with probabilities $\lambda_n^\pm \Delta t$
* $J_n^\pm$ sampled from fitted jump distributions
* intensities evolve via calibrated Hawkes model:

$$
\lambda_{n+1}^+
= \lambda_0^+

+ \alpha^+\sum_{m\le n} e^{-\beta (t_{n+1} - t_m)} N_m^+,
  $$

$$
\lambda_{n+1}^-
= \lambda_0^-

+ \alpha^-\sum_{m\le n} e^{-\beta (t_{n+1} - t_m)} N_m^-.
  $$

Simulation outputs $S_{t_n} = \exp(y_n)$.

This model naturally exhibits:

* calm regimes (QED diffusion only)
* event regimes (Hawkes feedback → jump clustering)
* reversals (jump-down then jump-up)
* volatility bursts
* heavy tails
* crash–recovery patterns typical of BTC

---

#### **2.5 Tasks**
You are not expected to obtain ‘perfect’ MLE. Reasonable, stable parameters that reproduce key stylized facts are sufficient.
- QED MLE: Estimated $(\hat{\theta},\hat{\kappa},\hat{g},\hat{\sigma})$
- Jump Detection
- Hawkes Calibration: Estimated $(\hat{\lambda}_0^+,\hat{\lambda}_0^-,\hat{\alpha}^+,\hat{\alpha}^-,\hat{\beta})$ 
- Compute Full Simulator Paths
- Share your findings on how to improve this simulator

---


### **3. Option Market Layer**

#### **3.1 Setup**

European options on BTC simulator with IV
$\sigma_{\text{IV}}(t, S_t, K, T)$ depending on:

 - Maturity: $\tau = T - t$
 - Moneyness: $K / S_t$
 - Local realised vol: $\hat{\sigma}_{\text{loc}}(t)$
- Your task is to construct this IV surface using these inputs.
---

#### **3.2 Contract Universe**
For simplicity, options are grouped into buckets by moneyness and time-to-maturity. Existing positions are marked to model prices each step and can be reduced/closed at those prices. You do not need to simulate the full listing mechanics of new strikes; just work with this fixed (moneyness, maturity) universe.
- Underlying: $S_t$
- Strikes: $K \in \{0.9 S_0, S_0, 1.1 S_0\}$
- Maturities: $T \in {1\text{d}, 7\text{d}}$
- Time to maturity: $\tau = T - t$

---

#### **3.3 Black–Scholes Pricing**

Let

$$
\sigma^\star = \sigma_{\text{IV}}^{\text{final}}(t, S_t, K, T).
$$

Then

$$
d_1
= \frac{\ln(S_t / K) + \tfrac{1}{2} (\sigma^\star)^2 \tau}
{\sigma^\star \sqrt{\tau}},
\qquad
d_2 = d_1 - \sigma^\star \sqrt{\tau},
$$

and call / put prices are

$$
C_t = S_t N(d_1) - K N(d_2),
\qquad
P_t = K N(-d_2) - S_t N(-d_1).
$$

Here $ N(\cdot) $ is the standard normal CDF.

---

#### **3.4 Tasks**

* **Choose and justify an IV surface** 
* **Implement** local vol estimation, IV surface, and pricing functions.
* **Examine** IV and price behaviour in calm vs. volatile regimes.

---

### **4. Market Making Strategy on BTC Perpetuals**

#### **4.1 State Variables and Risk Limits**

* At time $t_n$: price $S_n$, inventory $I_n$, cash $\text{Cash}_n$.
* Equity:
  $$
  \Pi_n = \text{Cash}_n + I_n S_n.
  $$
* Risk limits:
  $$
  |I_n| \le I_{\max}
  \quad \text{and} \quad
  \Pi_n \ge \Pi_{\min}.
  $$

---

#### **4.2 Base Quotes**

* Half-spread ( s_0 ):
  $$
  P_n^{\text{bid}} = S_n (1 - s_0),
  \qquad
  P_n^{\text{ask}} = S_n (1 + s_0).
  $$
* Base order size: ( q_0 ).

---

#### **4.3 Inventory Management**

* **Normalised inventory:**
  $$
  \phi_n = \frac{I_n}{I_{\max}}.
  $$

* **Skewed quotes:**
  $$
  P_n^{\text{bid}} = S_n \big( 1 - s_0 - k_s \phi_n \big),
  \qquad
  P_n^{\text{ask}} = S_n \big( 1 + s_0 + k_s \phi_n \big).
  $$

* **Size control:**
  $$
  q_n = q_0 \cdot \max\big(0,; 1 - |\phi_n|\big).
  $$

* **Hard kill-switch:**
  When ( |I_n| \ge I_{\max} ), the strategy only quotes in the direction that **reduces inventory**.

---

#### **4.4 Fill and Equity Update**

* **Fill model (5-minute candles).**

  At each step $n \to n+1$, let $H_{n+1}$ and $L_{n+1}$ be the **high** and **low** of the $(n+1)$-th 5-minute candle of the BTC price.

  - A bid order at price $P_n^{\text{bid}}$ with size $q_n$ is **fully filled** if
    $$L_{n+1} \le P_n^{\text{bid}} \le H_{n+1}.$$
  - An ask order at price $P_n^{\text{ask}}$ with size $q_n$ is **fully filled** if
    $$L_{n+1} \le P_n^{\text{ask}} \le H_{n+1}.$$




* After fills:
  $$
  \Pi_{n+1} = \text{Cash}_{n+1} + I_{n+1} S_{n+1}.
  $$

* Risk breach if
  $$
  |I_{n+1}| > I_{\max}
  \quad \text{or} \quad
  \Pi_{n+1} < \Pi_{\min}.
  $$

---

#### **Task 4.5 — Baseline MM Simulation**

* Implement the market-making strategy on your BTC price simulator.
* Run Monte Carlo simulations over a fixed horizon of 14 days.
* Inspect and comment on: **sample equity paths and inventory behaviour**.

---


### **5. RL-Based Hedging — MDP Formulation**

#### **5.1 RL Objective**

* Maximise long-run, post-cost PnL quality while controlling risk.
* Agent interacts with: BTC simulator (QED + Hawkes), option IV surface, fixed MM strategy.

We model the hedging problem as a finite-horizon MDP:

$$
\mathcal{M} = (S, A, P, R, \gamma),
$$

where the agent observes a state $s_n$, chooses an action $a_n$ (option trade), transitions under the simulator, and receives a reward $r_n$.

---

#### **5.2 Hedge Universe: Strikes and Maturities**

The RL agent trades only within a **small, liquid universe** of BTC options:

* **Moneyness / strikes**:
  $$
  K \in {0.9 S_0,; 1.0 S_0,; 1.1 S_0},
  $$
  corresponding to 10% OTM, ATM, and 10% OTM on the other side.

* **Maturities**:
  $$
  T \in {1\text{d},; 7\text{d}}.
  $$

* **Types**:

  * Calls and puts on the BTC simulator price $S_n$.

This gives a natural universe of up to:

* $3$ strikes $\times$ $2$ maturities $\times$ $2$ types (call/put)
  $= 12$ distinct option contracts.

You may restrict to a smaller subset (e.g. ATM options only) for computational reasons, but the default assumption is that the agent **has access to all** of these contracts.

---

#### **5.3 Portfolio, Delta and Vega**

Let:

* $I_n$: MM inventory in the BTC perpetual at time $n$.
* $Q_n^{(i)}$: position (in lots) in option contract $i$ at time $n$
  (e.g. “number of contracts” or a normalised lot size).
* $P_n^{(i)}$: price of option $i$ at time $n$.

The **total equity** (MM + options) is:

$$
\Pi_n^{\text{total}}
= \Pi_n^{\text{MM}} + \sum_i Q_n^{(i)} P_n^{(i)},
$$

where $\Pi_n^{\text{MM}}$ is the equity of the MM engine alone.

We define:

* $\Delta^{\text{MM}}_n$: delta of MM position (essentially $I_n$ if perp is 1:1 delta).
* $\Delta^{(i)}_n$: delta of option $i$ at time $n$.
* $\Delta^{\text{opt}}_n = \sum_i Q_n^{(i)} \Delta^{(i)}_n$: aggregate option delta.
* $\Delta^{\text{port}}_n = \Delta^{\text{MM}}_n + \Delta^{\text{opt}}_n$: net portfolio delta.

Similarly for vega:

* $V^{(i)}_n$: vega of option $i$ at time $n$.
* $V^{\text{opt}}_n = \sum_i Q_n^{(i)} V^{(i)}_n$.
* $V^{\text{port}}_n = V^{\text{opt}}_n$ (perpetual has negligible vega).

These quantities are part of the **risk state** the RL agent must learn to control.

---

#### **5.4 State Representation**

At each hedge decision time $n$ (e.g. every few 5-minute steps), the agent observes a state vector $s_n$.
A reasonable baseline state includes:

$$
s_n = (
S_n,;
I_n,;
\Pi_n^{\text{total}},;
\hat{\sigma}_{\text{loc}}(n),;
\Delta^{\text{port}}_n,;
V^{\text{port}}_n,;
\text{TTM features},;
\text{moneyness features},;
\Delta S_n,;
\Delta V_n
).
$$

Where:

* $S_n$: BTC price.
* $I_n$: MM inventory (BTC).
* $\Pi_n^{\text{total}}$: current equity of MM + options.
* $\hat{\sigma}_{\text{loc}}(n)$: local realised volatility from Section 3.
* $\Delta^{\text{port}}_n$: net portfolio delta.
* $V^{\text{port}}_n$: net portfolio vega.
* TTM features: time to maturity of relevant contracts (e.g. normalised).
* Moneyness features: e.g. $\log(K/S_n)$ for representative strikes.
* $\Delta S_n$: recent price change(s).
* $\Delta V_n$: recent changes in option prices or IV.

You may add/remove features (e.g. regime indicators, jump flags, realised variance windows), as long as you justify your design choices.

---

#### **5.5 Action Space: Discrete Option Trades**

At each decision time, the agent chooses **one discrete action** from a finite set $A$.

The natural design, given our universe, is:

* **No trade:**

  * Do nothing this step.

* **Option trades:**

  * For each option contract $i$ in the universe
    (strike $K \in {0.9 S_0, 1.0 S_0, 1.1 S_0}$,
    maturity $T \in {1\text{d}, 7\text{d}}$,
    call or put), define:

    * “buy 1 lot of option $i$”,
    * “sell 1 lot of option $i$”.

Let $q_{\text{opt}}$ be the **fixed lot size** per trade.
Then a “buy” action increases $Q_n^{(i)}$ by $+q_{\text{opt}}$,
and a “sell” action decreases $Q_n^{(i)}$ by $-q_{\text{opt}}$.

You may optionally restrict the action space to a smaller subset of contracts
(e.g. ATM 1d and ATM 7d only) if needed.

---

#### **5.6 Transaction Costs (Size-Aware)**

Each option trade incurs a transaction cost **proportional to notional**.
If at time $n$ we execute trades $\Delta Q_n^{(i)}$ in each contract $i$, then:

$$
TC_n = c_{\text{opt}} \sum_i \left| \Delta Q_n^{(i)} P_n^{(i)} \right|,
$$

where $c_{\text{opt}}$ is a cost rate (e.g. $0.0005$ for $0.05%$).

This cost term is **size-aware**:

* larger lots or more expensive options
  $\Rightarrow$ larger notional
  $\Rightarrow$ larger $TC_n$ penalty.

---

#### **5.7 Reward design**

design and justify your own RL reward function.
The RL agent should not be forced to keep the book exactly delta– and vega–neutral.  
We want to **allow under-hedging / over-hedging** as long as the **overall portfolio is profitable** and **tail risk is controlled**.

Your reward design must satisfy the following principles:

* **Profit focus.**
  The main positive signal should be **realised PnL**, net of transaction costs for hedging trades.
  Make clear what PnL you are using (per-step or cumulative increment).

* **Cost awareness.**
  Transaction costs for option hedges must enter the reward with the correct sign
  (higher costs should reduce reward).

* **Risk exposure control, but not hard neutrality.**
  You may expose the book to delta and vega risk, and the agent is allowed to under–hedge or over–hedge.
  However, your reward should **discourage extremely large risk exposures**
  (for example via soft penalties once $|\Delta^{\text{port}}_n|$ or $|V^{\text{port}}_n|$ exceed some comfort band).

* **Tail–risk awareness.**
  Include at least one component that penalises **bad tail outcomes over the whole episode**,
  such as large final loss, large drawdown, or a risk measure like downside variance or CVaR.
  This should make “rare but very large losses” unattractive even if average PnL is high.

* **No trivial solutions.**
  Check that your reward does **not** make degenerate policies obviously optimal
  (e.g. “never hedge” or “always fully hedge to zero risk” regardless of market conditions).

What you need to hand in:

* A **mathematical expression** of your reward (per-step and/or terminal), with all symbols defined.
* A short **written justification**  explaining:

  * how your reward trades off profit vs risk and transaction costs;
  * why it allows meaningful under–hedging / over–hedging;
  * why it is suitable for controlling tail risk in this assignment.

---

#### **5.8 Training Protocol**

A typical RL training setup:

1. **Episodes:**

   * Each episode simulates a BTC path using your QED + Hawkes model over a fixed horizon of 14 days.
   * Run the fixed MM strategy on the BTC perp throughout the episode.
   * The RL agent trades options at discrete decision times.

2. **Data diversity:**

   * Use different random seeds for the simulator.
   * Generate paths containing calm regimes, trending periods, jump clusters, and crash–rebound events.

3. **RL algorithm:**

   * Use any off-the-shelf algorithm suitable for discrete action spaces
   * The policy network takes $s_n$ as input and outputs a distribution over discrete actions.

4. **Evaluation:**

   * Evaluate the trained policy on **unseen** simulated paths.
   * Compare against at least:

     * unhedged MM baseline (no options),
     * a simple rule-based hedging strategy (e.g. buy 1d ATM puts when $|I_n|$ exceeds a threshold).

Key metrics:

* distribution of final PnL (mean, variance, skew, kurtosis),
* downside risk (quantiles, CVaR),
* maximum drawdown,
* total option cost paid,
* distribution of net delta / net vega over time.

---


### **6. Deliverables**

All notebooks and source code **must be submitted in a GitHub repository**.
This ensures transparency, version control, and easy reproduction of results.

---

#### **6.1 Recommended Project Outline**

Each major part of the assignment should correspond to (at least) one notebook:

1. **Data & Stylized Facts Analysis**
2. **BTC Simulator (QED + Hawkes)**
3. **Option Market Layer**
4. **Baseline Market-Making Strategy**
5. **RL Hedging Environment**
6. **RL Training Results**
7. **RL vs Baseline Evaluation**

---

#### **6.2 Recommended Repository Structure**

```text
.
├─ 01_data_and_stylized_facts.ipynb
├─ 02_btc_simulator_qed_hawkes.ipynb
├─ 03_option_market_layer.ipynb
├─ 04_baseline_mm_strategy.ipynb
├─ 05_rl_hedging_environment.ipynb
├─ 06_rl_training_results.ipynb
├─ 07_rl_vs_baseline_evaluation.ipynb
│
├─ src/
│  ├─ simulator.py        # QED + Hawkes simulator
│  ├─ option_surface.py   # IV surface + BS pricing
│  ├─ mm_strategy.py      # Market-making rule implementation
│  ├─ rl_env.py           # RL MDP environment
│  ├─ rl_agent.py         # RL agent (DQN / actor-critic, etc.)
│  └─ utils.py            # Shared utilities
│
├─ data/
│  ├─ btc_perp_5min.csv   # Raw data (or provided sample)
│  └─ sample_paths.npz    # Saved Monte Carlo paths (for fast evaluation)
│
├─ results/
│  ├─ plots/              # Figures generated from notebooks
│  └─ metrics.json        # Summary metrics (Sharpe, CVaR, tails, etc.)
│
├─ requirements.txt       # Python dependencies
└─ README.md              # Setup & run instructions
```

src/ is recommended but optional. You may keep all code in the notebooks for a quick prototype, but we recommend
refactoring reusable components into src/ if your project grows.
You may deviate slightly from this structure, but the role of each file/directory should remain clear.

---

#### **6.3 Reproducibility Requirements**

To receive full credit, your repository must satisfy the following:

1. **Fix all random seeds**
   Set seeds for NumPy, PyTorch / TensorFlow and any other libraries used, so that training and evaluation are repeatable.

2. **Include saved Monte Carlo price paths**
   Store a representative set of simulated BTC paths in a file (e.g. `sample_paths.npz` or `.pkl`), so evaluation can be reproduced without re-running long simulations.

3. **Save at least one trained RL agent checkpoint**
   Provide a model file (e.g. `.pt`, `.pth`, `.ckpt`) that can be loaded by `rl_agent.py` or the training notebook.

4. **Evaluate on at least 500 independent test paths**
   Run your final hedge policy on a large out-of-sample test set and save summary statistics (e.g. in `results/metrics.json`).

5. **Main notebooks must be fully reproducible**
   A fresh clone of your repository, plus

   ```bash
   pip install -r requirements.txt
   ```

   and running the notebooks (or a small driver script) following `README.md`
   must regenerate your key plots and evaluation metrics automatically.

> **Grading explicitly includes execution reproducibility.**
> A third party should be able to clone your repo, install dependencies, and reproduce your main results with minimal manual intervention.
