-
Notifications
You must be signed in to change notification settings - Fork 1
Mathematical Foundations
Full mathematical appendix for the Alpha Dual Engine v154.6. Every core formula derived from first principles, with worked numerical examples. ← Back to the project repository · README
This section provides rigorous mathematical interpretations of every core formula used in the Alpha Dual Engine. Each subsection starts from first principles, builds the intuition, walks through the derivation, and ends with a concrete numerical example. The goal is to make every symbol, subscript, and Greek letter fully transparent — no hand-waving allowed.
| # | Section | Formula Name | Key Formulas | What It Answers |
|---|---|---|---|---|
| 0 | Foundational Concepts | Mean Squared Error (Loss) | What is a loss function? | |
| Gradient Descent Update | How does the computer minimize any function? | |||
| A | XGBoost Ensemble Classifier | Log Loss (Binary Cross-Entropy) | How does the ML classifier detect market regimes? | |
| Gradient Boosting | Sequential trees correcting residuals | How does XGBoost build 50 trees into one prediction? | ||
| B | The Objective Function & SLSQP | Portfolio Objective Function | How does the optimizer pick portfolio weights? | |
| SLSQP Quadratic Subproblem | How does SLSQP approximate the objective at each iteration? | |||
| Lagrangian | What are Lagrange multipliers and shadow prices? | |||
| Hessian Matrix | How does the solver know the shape of the bowl? | |||
| Covariance Matrix | How is portfolio risk measured from historical data? | |||
| Scalar / Vector / Matrix |
|
What is |
||
| Dot Product (Momentum) | Why does risk need a matrix but momentum does not? | |||
| C | Shannon Entropy | Shannon Entropy | How does the system measure diversification? | |
| Effective N | What is "Effective N" and why require at least 3? | |||
| D | Geometric Brownian Motion | GBM Stochastic Differential Equation | How are future stock prices simulated? | |
| Discrete Simulation (Ito's Lemma) | What is the |
|||
| E | Proximal Policy Optimization (PPO) | PPO Clipped Surrogate Objective | How does the RL agent learn without destroying itself? | |
| Generalized Advantage Estimation | What is GAE and the actor-critic architecture? |
Math Flow: From Market Data to Portfolio Weights — Visual flowchart of the entire mathematical pipeline
Detailed sub-sections within each part:
Section 0 — Foundational Concepts
- What is a loss function? — Squared error, loss vs reward, the three losses in this project
- What is gradient descent? — The update rule, learning rate, worked numerical example
Section A — XGBoost Ensemble Classifier
- What problem it solves — Regime detection: RISK_ON vs DEFENSIVE
- What is gradient boosting? — Sequential trees correcting errors, log loss
- The 7 input features — VIX, momentum, trend, cross-asset signals
- Monotonic constraints — Enforcing financial logic on the learned function
- The consensus logic — XGBoost + Decision Tree must independently agree
- Walk-forward training — Expanding windows, never testing on training data
Section B — The Objective Function & SLSQP
- What the formula actually says — Breaking down each term (risk, momentum, entropy)
- How SLSQP actually solves it — The "approximate as a parabola and step" method
- What is the Hessian matrix? — Second partial derivatives, curvature, and the BFGS approximation
- How the covariance matrix is computed from data — Concrete worked example (SMH variance, SMH-TLT covariance), formulas, and annualization
-
What is w? — scalars, vectors, and matrices — Single numbers vs lists vs tables, the sandwich
$w^\top \Sigma w$ , the dot product$w \cdot M$ , and why risk needs a matrix but momentum does not - Can the quadratic subproblem be solved by hand? — Layer-by-layer breakdown from 1 variable to 12
- What is a Lagrange multiplier, exactly? — Intuition, worked example, geometric interpretation, shadow prices
- Solving the Linear System: Gaussian Elimination — Forward elimination, back-substitution, worked examples
- The complete picture — Summary table: which layer uses which math
Section C — Shannon Entropy
- The formula — Step-by-step calculation with real numbers
- Why does ln show up? — Intuition: "importance-weighted surprise"
- Concrete examples with real numbers — All-in vs equal-weight vs the actual portfolio
Section D — Geometric Brownian Motion
- The continuous-time SDE — Drift and diffusion decomposition
- Ito's Lemma and the volatility drag correction — Why symmetric gains/losses do not cancel
- The final simulation formula — The discrete-time equation the code actually uses
- A full worked example — One simulated day with real numbers
Section E — Proximal Policy Optimization (PPO)
-
Softmax — Formula, worked example, why
$e$ , key properties - Step 1: The Policy — Discrete (softmax) vs continuous (Gaussian) action spaces
- Step 2: The Value Function — Shared trunk, actor-critic architecture
- Step 3: Advantage Estimation (GAE) — TD error, bias-variance tradeoff
- Step 4: The Clipped Surrogate Objective — The core PPO innovation
- Step 5: The Full Loss Function — Policy loss + value loss + entropy bonus
- Step 6: The Complete Training Loop — Collect, compute, update cycle
- How the two agents work together — Hierarchical RL: regime agent → weight agent
This flowchart traces the entire mathematical pipeline — from raw prices to final allocation. Appendix section references (A, B, C, D, E) link to the full derivations below.
graph TD
subgraph DATA ["DATA PREPARATION"]
RAW["Raw Market Data<br/>12 assets × 15 yrs"]
RET["Daily Returns<br/>r = Δprice ÷ prev price"]
end
RAW --> RET
RET --> REG{"Regime Decision"}
REG -->|"default"| REG_C["Rule-Based Classifier<br/>SPY > 200-SMA? →<br/>RISK_ON / REDUCED / DEF"]
REG -->|"Use RL Agent"| REG_R["RL Regime Agent · Sec D<br/>25 macro feat → regime"]
REG_C --> WGT{"Weight Decision"}
REG_R --> WGT
WGT -->|"default"| OBJ["Objective Function · Sec A/B<br/>risk: w′Σw (from cov matrix)<br/>− momentum: Σmᵢ³wᵢ (from M³)<br/>− entropy: λH(w)"]
WGT -->|"Hierarchical RL"| W_R["RL Weight Agent · Sec D<br/>103 features → 12 weights"]
OBJ --> ITER["SLSQP Solver · Sec A<br/>Quad approx → Lagrangian<br/>→ solve → repeat"]
ITER -->|"converged"| W_C["12 Optimal Weights"]
ITER -->|"repeat"| OBJ
W_C --> FINAL["Final Portfolio Weights"]
W_R --> FINAL
subgraph DOWN ["DOWNSTREAM EVALUATION"]
MC["Sec C · Monte Carlo / GBM<br/>1M paths → risk assessment"]
end
FINAL --> MC
style DATA fill:#e3f2fd,stroke:#1565c0,color:#000
style DOWN fill:#fce4ec,stroke:#c62828,color:#000
Before diving into the specific formulas, two ideas underpin everything in this appendix: the loss function (what the computer is trying to achieve) and gradient descent (how it gets there). Every section that follows — SLSQP, Shannon Entropy, GBM, PPO — is built on these two foundations.
A loss function is a single number that measures how wrong the current answer is. The entire purpose of optimization — whether it is portfolio construction, neural network training, or anything else — is to make this number as small as possible.
The simplest example: Suppose you are predicting house prices. Your model guesses $500K. The real price is $450K.
If the model guesses $460K instead:
Lower loss means a better guess. The computer tries many guesses and adjusts to make the loss smaller — the method for doing this is gradient descent (explained below).
Why squared? Two reasons: (1) squaring makes negatives positive — a guess that is $50K too high and one that is $50K too low are equally bad. (2) Squaring punishes big mistakes disproportionately — being off by $100K is four times as bad as being off by $50K (
"Loss" vs "reward" — same idea, opposite sign. In optimization, you minimize loss (lower = better). In reinforcement learning, you maximize reward (higher = better). They are the same concept with the sign flipped:
The three loss functions in the Alpha Dual Engine:
| Loss function | What it measures | Minimized by | Code | Detailed in |
|---|---|---|---|---|
| Portfolio objective | Risk minus momentum minus diversification | SLSQP solver | alpha_engine.py:775 |
Section B below |
| PPO policy loss | How much to adjust action probabilities | PPO actor (gradient descent) | rl_weight_agent.py:1116-1119 |
Section E below |
| PPO value loss | How wrong the critic's prediction was (mean squared error) | PPO critic (gradient descent) | rl_weight_agent.py:1121 |
Section E below |
Each is explained with its full formula in the referenced section. The key insight: all three do the same thing conceptually — define "what is wrong" as a number, then make that number smaller.
PPO combines its losses into one number. In practice, PPO does not minimize policy loss and value loss separately. It adds them together — along with an entropy bonus — into a single total_loss:
total_loss = policy_loss + 0.5 × value_loss − 0.10 × entropy
In the code (rl_weight_agent.py:1127):
total_loss = policy_loss + vf_coef * value_loss - ent_coef * entropyWhy combine them? Because gradient descent (explained below) can only walk downhill on one landscape at a time. By adding the three terms into a single number, the optimizer can adjust all the network's parameters in one pass. The three terms pull in different directions — policy loss wants better actions, value loss wants better predictions, and the entropy bonus wants the agent to keep exploring — and gradient descent finds a compromise that improves all three simultaneously. The full derivation of each term and how they interact is in Section E.
Gradient descent is how the computer actually makes the loss function smaller. The loss function tells you "how wrong am I?" — gradient descent tells you "which direction should I adjust to be less wrong?"
The hiking analogy: You are blindfolded on a mountain. You want to reach the bottom of the valley. You cannot see, but you can feel the slope under your feet.
- Feel which direction is downhill (compute the gradient)
- Take a step that way (update your parameters)
- Repeat until the ground is flat (loss stopped decreasing)
What is a "gradient"? The gradient is the slope, generalized to multiple dimensions. With one variable, the slope tells you: "if I nudge
This is a 12×1 column vector — one row per weight, one column total. Each entry says "if I nudge THIS weight slightly, does the loss go up or down?" The gradient points uphill, so you go the opposite direction. (When you see
The update rule — the entire algorithm:
Three pieces:
The learning rate matters: Too big and you overshoot the valley and bounce around forever. Too small and you creep toward the answer in a million steps. In the RL agents, rl_regime_agent.py:670) and the weight agent uses 0.0001 (rl_weight_agent.py:1074) — because large steps in RL can destroy the policy.
Concrete example: Minimize
Derivative:
| Step | Slope |
Update: |
||
|---|---|---|---|---|
| 0 | 10.0 | 100.0 | 20.0 | 8.0 |
| 1 | 8.0 | 64.0 | 16.0 | 6.4 |
| 2 | 6.4 | 40.96 | 12.8 | 5.12 |
| 3 | 5.12 | 26.21 | 10.24 | 4.10 |
| ... | ... | ... | ... | ... |
| 20 | 0.115 | 0.013 | 0.23 | 0.092 |
Each step: compute slope, step opposite. The loss shrinks every time. After enough steps,
How gradient descent connects to the project:
| Component | Uses gradient descent? | What it does instead | Code |
|---|---|---|---|
| SLSQP (portfolio optimizer) | No | Quadratic subproblems (Section B) — more sophisticated but same spirit | alpha_engine.py:883 |
| PPO actor (policy network) | Yes | Adam optimizer (a fancier gradient descent with momentum) | rl_weight_agent.py:1102 |
| PPO critic (value network) | Yes | Adam optimizer — adjusts predictions to match actual returns | Same network and optimizer |
The neural networks in the RL agents have thousands of parameters. Gradient descent adjusts all of them simultaneously — each one nudged in the direction that reduces the loss.
A loss function (also called cost function or objective function) is a scalar-valued function that quantifies the discrepancy between a model's current output and the desired output. Gradient descent is the iterative algorithm that minimizes the loss by computing the gradient (the vector of partial derivatives indicating the direction of steepest ascent) and stepping in the opposite direction, scaled by a learning rate
Code: The entire classifier is implemented in the
AdaptiveRegimeClassifierclass (alpha_engine.py:271-477).
Before the optimizer can pick portfolio weights, the system needs to answer one question: "Is now a good time to take risk?" The answer determines the entire strategy — RISK_ON (aggressive growth), RISK_REDUCED (cautious), or DEFENSIVE (capital preservation). Getting this wrong is catastrophic: being aggressive during a crash wipes out the portfolio; being defensive during a bull run means missing all the gains.
The classifier takes 7 market features as input and outputs a probability between 0 and 1 — the likelihood that the next 21 trading days will be positive for the S&P 500. If the probability is high enough, the system enters RISK_ON. If not, it stays defensive.
The system uses two structurally different models and requires both to agree before taking risk. This consensus acts as a confirmation filter — if two models built from completely different logic independently say "go," the signal is more trustworthy than either model alone.
Model Alpha: XGBoost — "The Aggressor" (alpha_engine.py:281-291)
self.model_alpha = XGBClassifier(
n_estimators=50, # 50 sequential trees
max_depth=3, # each tree has at most 3 splits
learning_rate=0.05, # each tree contributes only 5% of the correction
monotone_constraints=(-1, -1, 0, 1, 1, 1, 0), # domain logic enforced
subsample=0.7, # each tree sees 70% of the data (randomness)
colsample_bytree=0.7, # each tree sees 70% of the features (randomness)
reg_lambda=1.0, # L2 regularization to prevent overfitting
random_state=42,
)Model Beta: Decision Tree — "The Skeptic" (alpha_engine.py:294-298)
self.model_beta = DecisionTreeClassifier(
max_depth=2, # only 2 splits — extremely simple
min_samples_leaf=200, # each leaf must contain at least 200 data points
random_state=99,
)The Decision Tree is deliberately shallow (max 2 splits = at most 4 possible outcomes). It cannot learn complex patterns — it acts as a conservative baseline that prevents XGBoost from overreacting to noise.
XGBoost stands for eXtreme Gradient Boosting. The key idea: build many small, weak models (shallow decision trees) sequentially, where each new tree focuses on correcting the mistakes of all the trees before it.
The analogy: Imagine 50 students taking a test one after another. Student 1 answers all questions and gets some wrong. Student 2 sees which questions Student 1 got wrong and focuses on those. Student 3 sees which questions are STILL wrong after Students 1 and 2, and focuses on those. By Student 50, the remaining errors are tiny. The final answer is the combined effort of all 50 students.
How it works step by step:
- Tree 1 makes predictions for the entire training set. Some predictions are wrong — these errors are called residuals.
- Tree 2 is trained NOT on the original target, but on the residuals from Tree 1. It learns to predict "how wrong was Tree 1?"
- The combined prediction is: Tree 1's prediction + (learning_rate × Tree 2's correction). The
learning_rate = 0.05means each tree only contributes 5% of its correction — this prevents any single tree from dominating. - Tree 3 is trained on the residuals of the combined prediction (Trees 1+2). It corrects what's still wrong.
- Repeat for all 50 trees.
The loss function XGBoost minimizes is log loss (binary cross-entropy) — the standard loss for binary classification (bull vs bear):
| Symbol | What it is | Example |
|---|---|---|
| Number of training examples | 2,500 trading days | |
| Actual outcome: 1 = bull, 0 = bear | 1 (next 21 days were positive) | |
| Model's predicted probability of bull | 0.73 | |
| Natural logarithm |
Why log loss and not just "percentage correct"? Accuracy treats all mistakes equally — predicting 0.51 when the answer is 1 counts the same as predicting 0.99 when the answer is 1. Log loss instead rewards confidence in the right answer. If the model says "73% bull" and it was indeed bull, the penalty is small (
Concrete example — one training day:
The model predicts
If the prediction had been
Lower loss — rewarding the model for being both correct AND confident.
Each of the 50 trees in the XGBoost ensemble is a shallow decision tree. A decision tree asks a series of yes/no questions to split the data into groups:
VIX > 25?
/ \
YES NO
(high fear) (calm market)
/ \
SPY momentum Trend score
< -5%? > 2%?
/ \ / \
BEAR UNSURE BULL UNSURE
With max_depth=3, the tree can ask at most 3 questions. This is intentionally shallow — a deep tree (max_depth=10) would memorize the training data ("on March 15, 2018, VIX was 23.4 and it was bull") instead of learning general patterns ("high VIX usually means bear").
How the tree chooses where to split: At each node, the tree tries every possible split of every feature (e.g., "VIX > 20?" vs "VIX > 25?" vs "VIX > 30?" vs "momentum > 0?" vs ...) and picks the one that produces the most information gain — the split that reduces the log loss the most.
The classifier uses 7 features, each capturing a different dimension of market conditions (alpha_engine.py:207-241):
| # | Feature | Code | What it measures |
|---|---|---|---|
| 1 | realized_vol |
vix / 100 |
Current fear level (VIX, scaled) |
| 2 | vol_momentum |
(vix / vix_21d_ago) - 1 |
Is fear rising or falling? |
| 3 | equity_risk_premium |
1/(SPY/SPY_252MA) - risk_free |
Are stocks cheap relative to bonds? |
| 4 | trend_score |
(SPY - SMA200) / SMA200 × 100 |
How far is SPY from its 200-day average? |
| 5 | momentum_21d |
SPY 21-day return |
Short-term momentum |
| 6 | qqq_vs_spy |
QQQ 63d return - SPY 63d return |
Is tech leading or lagging? |
| 7 | tlt_momentum |
TLT 21-day return |
Are bonds rallying (flight to safety)? |
The monotone_constraints=(-1, -1, 0, 1, 1, 1, 0) parameter (alpha_engine.py:285) is a critical safety feature. Each number corresponds to one of the 7 features:
| Feature | Constraint | Meaning |
|---|---|---|
realized_vol |
-1 | Higher VIX must decrease bull probability |
vol_momentum |
-1 | Rising fear must decrease bull probability |
equity_risk_premium |
0 | No constraint (relationship is ambiguous) |
trend_score |
+1 | Stronger uptrend must increase bull probability |
momentum_21d |
+1 | Higher momentum must increase bull probability |
qqq_vs_spy |
+1 | Tech outperformance must increase bull probability |
tlt_momentum |
0 | No constraint (bonds can mean safety OR rate cuts) |
Without these constraints, XGBoost might learn spurious patterns from limited data — for example, "VIX at 35 was bullish" from a single lucky data point during the 2020 recovery. The monotonic constraints enforce that the model's learned function never contradicts basic financial logic: high fear cannot be bullish, strong momentum cannot be bearish.
The regime decision requires both models to independently agree (alpha_engine.py:356-362):
for pa, pb, t in zip(probs_a, probs_b, test_trends):
if pa > 0.55 and pb > 0.50 and t > 0:
test_preds.append(1) # RISK_ON only if ALL THREE conditions met
else:
test_preds.append(0) # Default to safetyThree independent conditions must ALL be true:
- XGBoost says > 55% bull probability
- Decision Tree says > 50% bull probability
- The trend score is positive (SPY above its 200-day average)
If any condition fails, the system stays defensive. This "default to safety" design ensures the system only takes risk when multiple structurally different signals agree.
The final probability used for regime decisions is the average of both models, smoothed with an exponential moving average (alpha_engine.py:396-417):
probabilities.loc[test_dates] = (probs_a + probs_b) / 2
# ...
return probabilities.ffill().ewm(span=10).mean()The EMA smoothing with span=10 prevents daily jitter — without it, the probability might oscillate between 0.54 and 0.56 and trigger rapid regime flipping.
A standard ML train/test split would be: train on 2010-2020, test on 2021-2024. But this has a fatal flaw in finance: the market changes over time (regime shifts, new sectors, changing correlations). A model trained only on 2010-2020 data might not generalize to post-2020 conditions.
Walk-forward training (alpha_engine.py:315-417) solves this by repeatedly retraining as new data arrives:
| Window | Train on | Test on | What happens |
|---|---|---|---|
| 1 | 2010–2015 | 2016 | First model, 5 years of training data |
| 2 | 2010–2016 | 2017 | Retrained with 1 more year of data |
| 3 | 2010–2017 | 2018 | Retrained again — now has 2018 volatility data |
| ... | ... | ... | ... |
| N | 2010–2023 | 2024 | Latest model, all available history |
Each window: train both models on all data up to the cutoff, then predict the next year. The models are never tested on data they trained on — this prevents overfitting. The target variable is whether the S&P 500 was positive over the next 21 trading days (alpha_engine.py:327):
target = (returns.shift(-21).rolling(21).sum() > 0).astype(int).dropna()Because black-box models are unacceptable in institutional advisory, the classifier integrates SHAP (SHapley Additive exPlanations) (alpha_engine.py:384-394). SHAP assigns each feature a contribution score for every prediction — for example, on a specific day, it might show:
- VIX = 32 pushed the probability down by 0.15
- Strong SPY momentum pushed it up by 0.12
- Tech underperformance pushed it down by 0.08
- Net: the probability dropped because fear outweighed momentum
This allows the user to explain exactly why the system made each regime decision — not just what it decided.
The classifier's output feeds into the regime decision (alpha_engine.py:419-434):
def get_regime(self, ml_prob, spy_above_sma, ...):
if spy_above_sma: # MASTER SWITCH
return 'RISK_ON'
if ml_prob > 0.55: # ML tiebreaker
return 'RISK_REDUCED'
return 'DEFENSIVE' # Default to safetyThe SPY > 200-SMA check is a master switch that overrides the ML models: if SPY is above its 200-day moving average, the system goes RISK_ON regardless of what XGBoost says. The ML probability only matters when SPY is below the 200-SMA — in that uncertain zone, the classifier breaks the tie.
The regime then flows to the optimizer (Section B), which selects a different objective function depending on the regime:
| Regime | Objective | Effect on portfolio |
|---|---|---|
| RISK_ON | Maximize momentum, minimize risk | Concentrated in top performers |
| RISK_REDUCED | Mean-variance with higher risk aversion | More balanced, lower exposure |
| DEFENSIVE | Minimize risk, maximize safe havens | Heavily weighted toward bonds and gold |
The AdaptiveRegimeClassifier is a consensus ensemble combining XGBoost (50 gradient-boosted trees, max depth 3, with monotonic constraints encoding financial domain logic) and a shallow Decision Tree (max depth 2, min 200 samples per leaf) that serves as a conservative baseline. Both models must independently signal bullish probability above their respective thresholds (55% for XGBoost, 50% for Decision Tree) while SPY trend is positive for the system to enter RISK_ON. The ensemble is trained using walk-forward validation with expanding windows (initial 5-year training period, 12-month step), ensuring models are never evaluated on training data. The loss function minimized is binary cross-entropy (log loss), and SHAP values provide per-prediction feature attribution for explainability. The classifier's output probability, smoothed via a 10-day EMA, feeds into a hierarchical regime decision where SPY > 200-SMA acts as a master override.
The heart of the portfolio optimizer is the function it tries to minimize:
The goal: find the weight vector
In the code (alpha_engine.py:775-801):
def objective(w):
momentum_reward = np.dot(w, mom_arr) * config.ir_score_multiplier
w_pos = w[w > 1e-6]
entropy = -np.sum(w_pos * np.log(w_pos)) if len(w_pos) > 0 else 0
norm_entropy = entropy / max_entropy
port_vol = np.sqrt(np.dot(w.T, np.dot(cov_arr, w)))
return -momentum_reward - config.entropy_lambda * norm_entropy + growth_penalty + turnover_penalty + vol_penaltyThis objective function is not a universal formula like the area of a circle — it is custom-built for this strategy. However, it is assembled from standard, well-established components:
| Component | Origin | Custom part |
|---|---|---|
| Risk term |
Markowitz mean-variance theory (1952) | Standard — textbook portfolio risk |
| Momentum term |
Dot product is standard linear algebra | Cubing the momentum scores to amplify winners is a design choice |
| Entropy term |
Shannon information theory (1948) | Using entropy as a diversification nudge in portfolio optimization is a design choice |
| The combination: risk − momentum − entropy | — | Custom — the decision to combine these three terms with these signs and lambda weights is specific to this strategy |
| The constraints (30% cap, crypto floor, gold cap, etc.) | — | Entirely custom — these encode the specific investment thesis |
The individual ingredients are established mathematics that appear in textbooks. The recipe — which terms to include, with what signs, at what weightings, under which constraints — is a design decision made for this strategy. Nobody else using scipy.optimize.minimize(method='SLSQP') would have the same objective function unless they copied this one.
The objective combines three competing goals into a single scalar through a sign convention:
- Term 1 (risk) is positive → minimizing the objective pushes risk down.
-
Term 2 (momentum) has a negative sign → minimizing
$-\text{momentum}$ is mathematically equivalent to maximizing momentum. -
Term 3 (entropy) has a negative sign → minimizing
$-\text{entropy}$ is mathematically equivalent to maximizing entropy (diversification).
This is a standard technique in optimization: rather than building separate maximizers and minimizers, you flip the sign of anything you want to maximize and minimize the whole expression. One solver, one pass, three goals satisfied simultaneously.
Important clarification: Despite the
In a simple case without constraints, you would take the derivative, set it to zero, and solve. But here:
- The entropy term
$H(\mathbf{w}) = -\sum w_i \ln(w_i)$ — the$\ln(w_i)$ makes the derivative non-linear - The inequality constraints (bounds, caps, floors) — you cannot just "set derivative = 0" when the answer must also satisfy 10+ inequalities
- The growth anchor penalty uses
$\max(0, \ldots)^2$ — the$\max$ function is not differentiable everywhere
The inequality constraints are the fundamental barrier. With equality constraints only (e.g., "weights sum to 1"), you could apply Lagrange multipliers, set up a system of equations, and solve via Gaussian elimination — all doable by hand. But with inequality constraints (e.g., "SMH
With approximately 20 inequality constraints, there are
SLSQP's active set method handles this automatically: it maintains a working guess of which constraints are active, solves the resulting equality-only subproblem, checks whether any inactive constraints are violated or any active constraints should be released, adjusts the active set, and repeats. It typically converges in 20-50 iterations rather than exhaustively searching all
The code uses scipy.optimize.minimize with the SLSQP method (alpha_engine.py:883-890) (Sequential Least Squares Quadratic Programming).
The name, decoded:
- Sequential — it solves the problem step by step, not all at once
- Least Squares — the method it uses to handle constraints (fits them like a "best fit" line)
- Quadratic Programming — at each step, it pretends the problem is a simpler "quadratic" (parabola-shaped) problem and solves that instead
The core idea: "Approximate and step"
Imagine you are blindfolded on a hilly landscape and you need to find the lowest valley. You cannot see the whole landscape, but you CAN feel the ground right around your feet.
What SLSQP does at each step:
Step 1 — Start with a guess. The solver picks an initial set of weights (actually it tries multiple random starting points via _multi_start_optimize).
Step 2 — Feel the ground around you. Compute the gradient (slope) and curvature (is the slope getting steeper or flatter?) at the current position.
Step 3 — Build a mental model. Approximate the nearby landscape as a simple parabola (a bowl shape). A parabola is easy to solve — its minimum is just the bottom of the bowl. This is the "Quadratic Programming" part.
Mathematically, at the current weights
Do not confuse these two formulas. The objective function
$\mathcal{L}(\mathbf{w}) = \lambda_{\text{risk}} \mathbf{w}^\top \Sigma \mathbf{w} - \lambda_{\text{mom}} (\mathbf{w} \cdot \mathbf{M}) - \lambda_{\text{entropy}} H(\mathbf{w})$ is the real problem — what we want to minimize. The formula above is SLSQP's approximation of that problem at a single point$\mathbf{w}_k$ . SLSQP never solves the objective function directly. Instead, at each iteration, it builds this quadratic approximation (a parabola that matches the objective's value, slope, and curvature at the current point), solves the parabola exactly, moves to the answer, and rebuilds. The objective function is the destination; the quadratic subproblem is the vehicle.
This formula is the second-order Taylor expansion — the same approximation taught in introductory calculus, extended to multiple variables. If you have seen the 1D version, you already know the SLSQP formula.
The 1D Taylor expansion you learned in calculus:
For any function
Concrete example:
-
$f(2) = 8$ — the value where you are standing -
$f'(2) = 3 \times 2^2 = 12$ — the slope at that point -
$f''(2) = 6 \times 2 = 12$ — the curvature at that point
Plugging in:
That is a parabola that touches the true
Going from 1D to 12D — nothing new is invented, scalars become vectors:
| 1D (one weight) | 12D (twelve weights) | What changed |
|---|---|---|
|
|
|
Scalar → vector |
|
|
|
Same idea, just 12 numbers instead of 1 |
|
|
|
One derivative → a list of 12 partial derivatives (the gradient) |
|
|
|
One second derivative → 144 second derivatives (the Hessian matrix) |
|
|
|
Same idea |
|
|
|
Squaring one number → a matrix sandwich that accounts for curvature in every direction and every pairwise interaction |
The SLSQP formula is the 1D Taylor expansion with vectors and matrices substituted for single numbers. The
SLSQP visualized with 2 weights (a slice of the full 12D problem), viewed from above as a contour map.
Blue contours: the true objective function. Green dashed contours: SLSQP's quadratic approximation (the "bowl") —
notice the green ellipses match the blue contours near the current point (orange dot) but diverge further away.
At each iteration, SLSQP jumps from the current point to the bowl's bottom (green square), rebuilds the
approximation, and repeats until it reaches the true minimum (blue star).
The real 12-weight version is this same concept in 12 dimensions — and internally the solver adds a 13th variable,
the Lagrange multiplier λ, which enforces the "weights sum to 1" constraint.
The 2D contour plot above is a slice — pick 2 of the 12 weights, freeze the other 10, and you get a picture. The real 12-weight version cannot be visualized because humans cannot see beyond 3 dimensions. What the solver actually works with at each iteration is pure numbers:
| Object | 2D (the GIF above) | 12D (the real code) |
|---|---|---|
| Current weights | A dot on a 2D plane | A list of 12 numbers: [0.22, 0.18, 0.30, ...]
|
| Gradient |
A 2D arrow | 12 numbers: one slope per weight |
| Hessian |
A 2×2 matrix (4 numbers) | A 12×12 matrix (144 numbers, 78 unique) |
| Contour lines | Curves you can see | 11-dimensional hypersurfaces (impossible to picture) |
| The "bowl" | An ellipse | A 12-dimensional ellipsoid (impossible to picture) |
| Step to next point | An arrow on the plane | A 12-number direction vector |
The math is identical — more weights means a bigger gradient vector, a bigger Hessian matrix, and a bigger linear system to solve, but the algorithm is the same. In the actual code (alpha_engine.py:883), scipy.optimize.minimize(..., method='SLSQP') handles all of this internally. You never see the Hessian or the individual iterations — just the final 12 weights that come out.
Unlike the objective function (which is custom-built for this strategy), the quadratic subproblem formula is entirely standard mathematics. It is the second-order Taylor expansion — the same approximation taught in multivariable calculus courses and used across all of numerical optimization, not just finance. Newton's method, quasi-Newton methods, and all sequential quadratic programming (SQP) solvers use this same expansion. The only project-specific element is what is being approximated: in our case, the portfolio objective function. The approximation machinery itself is off-the-shelf.
| Component | Origin | What problem it solves |
|---|---|---|
| Second-order Taylor expansion | Calculus (Taylor, 1715) | The objective is too complex to minimize directly — approximate it as a parabola locally |
| Hessian matrix |
Multivariable calculus (Hesse, 1842) | With 12 weights, we need curvature in every direction simultaneously — the Hessian is the multi-dimensional second derivative |
| BFGS approximation of |
Numerical optimization (Broyden, Fletcher, Goldfarb, Shanno, 1970) | Computing the exact 12×12 Hessian (78 second derivatives) every iteration is expensive — BFGS infers curvature by watching how the gradient changes between steps |
| SQP framework | Constrained optimization (Wilson, 1963; Han, 1976) | Taylor + Hessian gives an unconstrained quadratic, but the real problem has constraints (weights sum to 1, caps, floors) — SQP wraps each subproblem in Lagrange multipliers and active-set methods |
Each component solves a limitation left by the previous one. Taylor simplifies the function but only works for one variable. The Hessian extends it to 12 variables but is expensive to compute exactly. BFGS makes it cheap but doesn't handle constraints. SQP adds constraint handling. Together, these four layers form SLSQP — the "Sequential" refers to solving a sequence of these quadratic subproblems, each one more accurate than the last.
The three terms in the quadratic subproblem each answer a different question at the current point
| Term | Formula | Question it answers | Analogy |
|---|---|---|---|
| Value | "How good is the current position?" | Your altitude on the hill right now | |
| Gradient (first-order) | "Which direction is downhill, and how steep?" | The slope of the ground under your feet | |
| Curvature (second-order) | "How far away is the bottom?" | Whether you're in a wide valley (far to walk) or a narrow gorge (bottom is right below) |
- With only the value term, you know where you are but not where to go.
- Adding the gradient tells you the direction, but not the distance — you'd overshoot or undershoot.
- Adding the curvature (the Hessian
$B$ ) gives you both direction and distance, so SLSQP can jump directly to the approximate bottom of the bowl in one step.
This is why the quadratic approximation is more powerful than gradient descent: gradient descent only uses the first two terms (value + slope) and must take many small, cautious steps. SLSQP uses all three terms and can take large, confident jumps — at the cost of computing or approximating the Hessian.
Why a quadratic approximation? Any smooth function, if you zoom in close enough, looks like a parabola. A linear approximation (straight line) tells you which direction is downhill, but not how far to go. A quadratic approximation (parabola) tells you the direction AND roughly where the bottom is.
The gradient (first derivatives) tells you the slope — which direction is downhill. But slope alone does not tell you how far to walk. Two landscapes can have the same slope at your feet but completely different shapes:
- Gentle curvature (wide bowl): the slope is steep but the valley floor is far away → take a big step
- Sharp curvature (narrow bowl): the slope is steep but the valley floor is right below you → take a small step
The Hessian matrix captures exactly this distinction. It is a table of second derivatives — derivatives of derivatives — measuring how fast the slope itself is changing in every direction.
Building the intuition step by step:
For a function of one variable, say
- First derivative
$f'(x) = 6x$ → the slope. At$x = 5$ , the slope is 30 (steep uphill). - Second derivative
$f''(x) = 6$ → the curvature. It is constant — the bowl has the same width everywhere. A large second derivative means a narrow bowl (minimum is nearby); a small one means a wide bowl (minimum is far away).
For a function of two variables, say
First, take the first partial derivatives (the gradient). A partial derivative means differentiating with respect to one variable while treating all others as constants — the
Then, take partial derivatives of those partial derivatives (second partial derivatives):
From the first partial derivative with respect to
From the first partial derivative with respect to
Finally, arrange the four numbers into a grid — that is the Hessian:
| 6 | 2 | |
| 2 | 10 |
The "matrix" is just bookkeeping — four numbers placed in a 2x2 table. There is no matrix multiplication or linear algebra involved in computing the Hessian. You take derivatives twice and write the results in a grid. Notice the off-diagonals are both 2 — this always happens (the order of differentiation does not matter), which is why the Hessian is always symmetric.
What each entry means:
-
Top-left = 6 (
$\partial^2 f / \partial x^2$ ): Curvature in the$x$ direction alone — how fast the$x$ -slope changes as you move in$x$ . -
Bottom-right = 10 (
$\partial^2 f / \partial y^2$ ): Curvature in the$y$ direction alone. Since 10 > 6, the bowl is narrower in$y$ than in$x$ — the optimizer needs a smaller step in$y$ to reach the bottom. -
Off-diagonals = 2 (
$\partial^2 f / \partial x , \partial y$ ): The interaction between variables — changing$x$ affects the slope in the$y$ direction. If this were 0, the two variables would be completely independent.
The off-diagonal entries are always symmetric (the top-right and bottom-left are equal). This is why the Hessian is always a symmetric matrix.
For the portfolio with 12 weights, the Hessian is a 12x12 symmetric matrix — 12 diagonal entries (how curvy the landscape is in each weight's direction) and 66 off-diagonal entries (how each pair of weights interacts). Together, these 78 unique numbers completely describe the shape of the local bowl that SLSQP uses to decide where to step.
Why SLSQP approximates the Hessian instead of computing it exactly:
Computing the exact 12x12 Hessian requires evaluating 78 second derivatives at every iteration — expensive. Instead, SLSQP uses a technique called BFGS (Broyden-Fletcher-Goldfarb-Shanno) that estimates the Hessian from gradient changes between iterations. The logic is: "last step I moved from point A to point B, and the gradient changed from
What else BFGS does beyond approximating the Hessian:
-
It updates, not recomputes. BFGS does not build a new 12×12 matrix from scratch each iteration. It takes the previous estimate and applies a small correction using two vectors from the latest step:
$s_k = w_{k+1} - w_k$ (how far you moved) and$y_k = \nabla L_{k+1} - \nabla L_k$ (how much the gradient changed). If you moved a little and the gradient changed a lot, the curvature in that direction is steep. If the gradient barely changed, it is flat. BFGS encodes this into a rank-2 update — meaning it adjusts only two "directions" in the matrix per step, leaving the rest unchanged. -
It starts from the identity matrix. At the very first iteration, BFGS has no history, so it sets
$B_0 = I$ (the identity matrix — all 1s on the diagonal, 0s elsewhere). This is equivalent to assuming "the curvature is equal in all directions," which makes the first step essentially gradient descent. But after a few iterations, the estimate improves rapidly. -
It is self-correcting. Even if early estimates are poor, each new step refines the approximation. After roughly
$n$ steps (where$n$ is the number of variables — 12 for this portfolio), BFGS has "seen" curvature information in enough directions to have a good estimate of the full Hessian. -
It guarantees the bowl opens upward. The Hessian must be positive definite (the bowl must open upward, not downward) for the quadratic subproblem to have a minimum. If the bowl opened downward, the "minimum" would be at negative infinity — nonsensical. BFGS guarantees positive definiteness by construction in every update, whereas the true Hessian might not always be positive definite for non-convex functions. This is a safety property: BFGS always produces a solvable subproblem.
The one-sentence summary: The gradient says "go downhill." The Hessian says "the bottom of the hill is approximately this far away in that direction." Without the Hessian, the solver knows which way to walk but not how far — with it, the solver can jump directly to (approximately) the bottom in one step.
Step 4 — Find the bottom of that bowl. But ONLY within the allowed zone (you cannot step outside the fence = constraints):
- All weights must sum to 1 (equality constraint)
- Each weight must stay within its bounds (0% to 30% for equities, etc.)
- Growth anchors must be >= 40% total
- Crypto + growth anchors <= 95%
Important distinction: The objective function
Step 5 — Walk there. Update the weights to that new position.
Step 6 — Repeat steps 2-5 until the function stops improving (converges), or you hit 1000 iterations.
| Method | Handles constraints? | Speed | Used for |
|---|---|---|---|
| Gradient Descent | No | Slow | Deep learning |
| Newton's Method | No | Fast | Unconstrained problems |
| Linear Programming | Only linear problems | Fast | Supply chain, logistics |
| SLSQP | Yes — all types | Fast | Exactly this kind of problem |
SLSQP is the go-to for "small-to-medium nonlinear problems with constraints" — which is exactly what portfolio optimization is (12 weights, ~10 constraints, nonlinear objective).
In formal terms, this is a constrained nonlinear optimization problem. The objective function combines a quadratic risk term, a linear momentum term, and a nonlinear entropy regularizer, subject to equality constraints (weights sum to 1) and bound constraints (per-asset caps). It is solved numerically using SLSQP — a sequential quadratic programming method that approximates the problem as a series of simpler quadratic subproblems at each iteration, converging to a local minimum while respecting all constraints. The objective function itself is not a Lagrangian — it is a cost function to be minimized. Internally, however, SLSQP constructs a Lagrangian at each iteration to enforce the constraints on the quadratic subproblem, using Lagrange multipliers for equality constraints and an active set method for inequality constraints. The method can be understood as Newton's method extended to handle both equality and inequality constraints.
The name comes from the shape of the approximation, not from the number of layers or steps.
-
"Quadratic" — because SLSQP approximates the objective as a quadratic function (a parabola in 1D, a bowl in 12D). This is the second-order Taylor expansion. Quadratics have a clean closed-form minimum — take the derivative, set it to zero, and solve (see Layer 1 for the 1D case and Layer 2 for the 12D case where this becomes
$B\mathbf{d} = -\nabla \mathcal{L}$ ). - "Sub" — because it is a smaller, simpler problem solved inside each iteration of the main problem. The main problem (minimize risk - momentum + entropy) is too complex to solve directly because of the entropy logarithm. The subproblem (minimize the bowl approximation) is easy.
- "Sequential" (the "S" in SLSQP) — because the solver solves a sequence of these quadratic subproblems, one per iteration, each at a new point, until convergence.
In short: main problem (hard) → approximate as quadratic subproblem (easy) → solve → move → rebuild → repeat.
Yes — and that is the entire point of SLSQP. It converts one impossible problem into a chain of easy problems. Each individual subproblem is solvable with high school and early university math. Here is how it breaks down, layer by layer.
The simplest optimization problem:
The parabola's bottom is at
Instead of one
To find it, take 12 partial derivatives (one per weight), set them all to zero, and solve the resulting system of equations simultaneously. This reduces to:
Where
Each entry tells the solver how much to adjust one weight. For example,
This is a system of 12 linear equations with 12 unknowns — solvable by Gaussian elimination (systematically manipulating equations to isolate each variable). Tedious with 12 variables, but each operation is just addition, subtraction, multiplication, and division.
Without constraints, the bottom of the bowl might be at weights like
The constraint
This is where the Lagrange multiplier enters. A new variable
Taking derivatives with respect to all 12 weights AND
The multiplier
The real-world intuition: You are at a buffet. You want to eat the most delicious combination of food possible. But you have one rule: your plate can only hold 1 kg total. Without the rule, you would pile on infinite amounts of the best dish. But the 1 kg limit forces tradeoffs — more steak means less dessert. The Lagrange multiplier answers a very specific question: "If my plate could hold 1.01 kg instead of 1 kg, how much more deliciousness could I get?" If the answer is "a lot" — the plate size is really constraining you. If the answer is "barely any" — the plate size does not matter much. The multiplier is a number that measures how much the constraint is costing you.
A worked example from scratch: Find the
Without the constraint, the answer is obviously
The geometric picture: Imagine concentric circles centered at the origin (a bullseye), getting bigger. The constraint is a diagonal line cutting through. You want the smallest circle that still touches the line. The point where the smallest circle is tangent to the line is the answer.
Step 1 — Write the constraint as "something = 0":
Step 2 — Build the Lagrangian (the original function plus the multiplier times the constraint):
Step 3 — Take partial derivatives with respect to every variable (including
The third equation is just the original constraint coming back. This always happens — the derivative with respect to
Step 4 — Solve the system. From equation 1:
The answer:
What does
Why this trick works — the deep geometric intuition: At the optimum on the constraint, two things must be true simultaneously: (1) you are ON the constraint line, and (2) you cannot improve by sliding along the line in either direction. Condition 2 means the gradient (slope) of
Applied to the portfolio: The objective is
Take 13 partial derivatives (12 weights +
Multiple constraints = multiple multipliers. The portfolio has more than one constraint, and each one gets its own multiplier:
| Constraint | Multiplier | What it measures |
|---|---|---|
| Weights sum to 1 | "How much would a 101% budget help?" | |
| SMH <= 30% | "How much is the SMH cap costing me?" | |
| Growth anchors >= 40% | "How much is the 40% floor costing me?" | |
| Crypto <= 15% | "How much is the crypto cap costing me?" |
If
Every Lagrange multiplier problem ends with a system of linear equations — set all partial derivatives to zero, and you get something like "13 equations with 13 unknowns." Gaussian elimination is the method for solving these systems. It is one of the oldest algorithms in mathematics, used by Carl Friedrich Gauss in 1809 to compute the orbit of the asteroid Ceres from noisy telescope observations.
The idea is simple: use one equation to eliminate one variable from all the others, repeat until each equation has only one variable, then read off the answers backwards.
No method needed. Just divide.
Phase 1 — Forward elimination (kill
Multiply the first equation by 2:
Subtract from the second equation:
Now the system looks like a triangle (upper triangular form):
Phase 2 — Back-substitution (work upward):
From the second equation:
Plug into the first:
Answer:
Phase 1 — Forward elimination:
Step 1: Eliminate
Equation 2
Equation 3
Step 2: Eliminate
New Equation 3
Now the system is triangular:
Phase 2 — Back-substitution (bottom to top):
From equation 3:
From equation 2:
From equation 1:
Answer:
-
Forward elimination: Use equation 1 to eliminate variable 1 from equations 2 through
$N$ . Then use the modified equation 2 to eliminate variable 2 from equations 3 through$N$ . Continue until the system forms a triangle — each equation has one fewer variable than the one above it. -
Back-substitution: The bottom equation now has one variable — solve it directly. Plug that answer into the equation above to find the next variable. Work upward until all
$N$ variables are solved.
For
When the SLSQP solver builds a quadratic subproblem with Lagrange multipliers, the result of setting all partial derivatives to zero is a system of linear equations — 12 weight unknowns + 1 (or more) multiplier unknowns. The solver uses Gaussian elimination (or a computationally equivalent method like LU decomposition) to solve this system at every iteration. The entire chain:
| Step | What happens | Method |
|---|---|---|
| 1. Original problem | Nonlinear objective with constraints | Cannot be solved directly |
| 2. SLSQP approximation | Replace with quadratic subproblem | Taylor expansion |
| 3. Lagrange multipliers | Convert constraints into derivatives | Set all partials to zero |
| 4. Linear system |
|
Gaussian elimination |
| 5. Basic arithmetic | Multiply, add, subtract, divide | High school math |
Every layer converts an unsolvable problem into a solvable one. At the very bottom, the entire optimization reduces to nothing more than multiplication and addition — the same operations taught in primary school. The computer's advantage is not intelligence; it is speed.
To see exactly what the solver computes at each iteration, here is a complete worked example using a simplified 3-asset portfolio.
Setup:
- 3 assets (A, B, C) with momentum scores
$M = [3, 1, 2]$ (Asset A trending strongest, B weakest) - Covariance matrix (all variances = 2, all correlations = 0):
| Asset A | Asset B | Asset C | |
|---|---|---|---|
| Asset A | 2 | 0 | 0 |
| Asset B | 0 | 2 | 0 |
| Asset C | 0 | 0 | 2 |
The diagonal entries are the ones running from top-left to bottom-right (row 1 column 1, row 2 column 2, row 3 column 3) — these are each asset's variance, measuring how much it moves on its own. All three are 2 here, meaning equal volatility.
The off-diagonal entries are everything else — the six zeros. Each one represents the covariance between a pair of assets: row 1 column 2 = how A and B move together, row 1 column 3 = how A and C move together, row 2 column 3 = how B and C move together (the remaining three are mirrors — the matrix is always symmetric). Zero means the assets move independently of each other.
In the real portfolio, these off-diagonals would be non-zero: SMH and QQQ would have a large positive covariance (they tend to rise and fall together), while TLT might have a negative covariance with SMH (bonds often rise when tech falls). But setting them to zero here removes the cross terms and keeps the arithmetic simple.
Before the formulas, here is a concrete worked example showing exactly what variance and covariance mean with actual numbers.
Step 1 — Start with one asset: variance
Suppose SMH (semiconductors) has the following daily returns over 5 days:
| Day | SMH return |
|---|---|
| Mon | +2% |
| Tue | -1% |
| Wed | +3% |
| Thu | -2% |
| Fri | +1% |
The average return is
Variance answers one question: "how jumpy is this asset?" To compute it, take each day's return, subtract the average, and square the result:
- Mon: return +2%, deviation
$= 2 - 0.6 = +1.4$ , squared$= 1.96$ - Tue: return -1%, deviation
$= -1 - 0.6 = -1.6$ , squared$= 2.56$ - Wed: return +3%, deviation
$= 3 - 0.6 = +2.4$ , squared$= 5.76$ - Thu: return -2%, deviation
$= -2 - 0.6 = -2.6$ , squared$= 6.76$ - Fri: return +1%, deviation
$= 1 - 0.6 = +0.4$ , squared$= 0.16$
Variance
(We divide by 4 instead of 5 — this is called "N-1" or Bessel's correction, a statistical adjustment that gives a better estimate when working with a sample rather than the entire population.)
The squaring serves two purposes: (1) it makes negative deviations positive — being 2% below average is equally "jumpy" as being 2% above average, and (2) it punishes large deviations more — a 3% swing contributes more to variance than three 1% swings combined. A large variance means the asset's returns swing widely; a small variance means they stay close to the average.
Step 2 — Add a second asset: covariance
Now add TLT (bonds) alongside SMH:
| Day | SMH | TLT |
|---|---|---|
| Mon | +2% | -1% |
| Tue | -1% | +2% |
| Wed | +3% | -2% |
| Thu | -2% | +3% |
| Fri | +1% | 0% |
Notice the pattern: when SMH goes up, TLT tends to go down, and vice versa. They move in opposite directions. This is exactly the relationship between stocks and bonds that makes them useful together in a portfolio — when one crashes, the other cushions you.
Covariance answers: "do these two move together or opposite?" Instead of squaring one asset's deviation, you multiply the deviations of both assets together:
TLT average
- Mon: SMH deviation
$= 2 - 0.6 = +1.4$ , TLT deviation$= -1 - 0.4 = -1.4$ , product$= (+1.4) \times (-1.4) = -1.96$ - Tue: SMH deviation
$= -1 - 0.6 = -1.6$ , TLT deviation$= 2 - 0.4 = +1.6$ , product$= (-1.6) \times (+1.6) = -2.56$ - Wed: SMH deviation
$= 3 - 0.6 = +2.4$ , TLT deviation$= -2 - 0.4 = -2.4$ , product$= (+2.4) \times (-2.4) = -5.76$ - Thu: SMH deviation
$= -2 - 0.6 = -2.6$ , TLT deviation$= 3 - 0.4 = +2.6$ , product$= (-2.6) \times (+2.6) = -6.76$ - Fri: SMH deviation
$= 1 - 0.6 = +0.4$ , TLT deviation$= 0 - 0.4 = -0.4$ , product$= (+0.4) \times (-0.4) = -0.16$
Covariance
The result is negative, confirming what we saw in the table: they move in opposite directions. Here is why the sign works:
- When SMH is above average and TLT is below average, the product is (positive)
$\times$ (negative)$=$ negative - When SMH is below average and TLT is above average, the product is (negative)
$\times$ (positive)$=$ negative - Either way, opposite movement produces negative products → negative covariance
If instead both assets tended to be above average on the same days and below average on the same days, the products would be (positive)
If there is no consistent pattern (some days same direction, some days opposite), the positive and negative products cancel out → covariance near zero, meaning the assets move independently.
Step 3 — Build the full table: the covariance "matrix"
With 3 assets (SMH, TLT, QQQ), you compute the variance/covariance for every pair and arrange them in a table:
| SMH | TLT | QQQ | |
|---|---|---|---|
| SMH | 4.30 | -4.30 | +3.80 |
| TLT | -4.30 | 4.30 | -3.50 |
| QQQ | +3.80 | -3.50 | 3.90 |
(The QQQ numbers are illustrative — the key point is the sign pattern.)
How to read it:
- Diagonal (4.30, 4.30, 3.90) = each asset's variance (how jumpy on its own)
- SMH-TLT = -4.30 = strong negative covariance (they move opposite — good hedge)
- SMH-QQQ = +3.80 = strong positive covariance (they crash together — bad for diversification)
- TLT-QQQ = -3.50 = negative covariance (bonds hedge tech — good)
Reading this table:
- Diagonal (top-left to bottom-right): each asset's variance — how jumpy it is on its own
- Off-diagonal: covariance between each pair — positive means they move together, negative means they move opposite, near zero means independent
- Symmetric: cov(SMH,TLT) = cov(TLT,SMH), because "how SMH moves relative to TLT" is the same question as "how TLT moves relative to SMH"
That is the entire covariance matrix. It is a lookup table that answers, for any pair of assets: "do these two tend to crash together, hedge each other, or not care about each other?"
With 12 assets, the table is 12x12 = 144 entries, but because the matrix is symmetric (the bottom-left half mirrors the top-right half), many entries are copies. The unique entries are:
- 12 diagonal entries (variances): one per asset (SMH-SMH, TLT-TLT, QQQ-QQQ, etc.)
-
66 unique pairs above the diagonal: the number of ways to pick 2 assets from 12
$= 12 \times 11 / 2 = 66$ (SMH-TLT, SMH-QQQ, SMH-IWM, ..., GLD-ETH) - 66 copies below the diagonal: mirrors of the 66 above (TLT-SMH = SMH-TLT, etc.)
Total unique numbers
Step 4 — How the code computes it from real data
The code calculates the covariance matrix from historical daily returns (alpha_engine.py:534-535):
mean_ret = returns.mean() * 252
cov = returns.cov() * 252- Daily returns — for each asset, compute the percentage change each day (e.g., the price went from 100 to 102, so the daily return is +2%)
- Standard deviation — measure how spread out those daily returns are over a 60-day rolling window. This is the daily volatility.
-
Annualize — multiply by
$\sqrt{252}$ (there are 252 trading days per year). The square root comes from a statistical property: variance scales linearly with time, so standard deviation scales with the square root. For example, if daily volatility is 1.5%, annual volatility is$1.5% \times \sqrt{252} \approx 24%$ . -
Covariance matrix —
returns.cov()computes pairwise covariances across all assets, then* 252annualizes (covariance is variance-like, so it scales linearly with time, not with the square root). The diagonal of this matrix contains each asset's variance; the off-diagonals contain pairwise covariances.
The formulas (the general versions of what we just computed by hand):
The variance of asset
Where
The covariance between assets
This is exactly the SMH-TLT calculation above in compact notation — subtract each asset's mean, multiply the two deviations together, average. Instead of squaring one asset's deviation, you multiply the deviations of two different assets together. If both tend to be above their means on the same days (both positive deviations), the product is positive → positive covariance (they move together). If one tends to be above when the other is below (opposite signs), the product is negative → negative covariance (they move oppositely). If there is no consistent pattern, the positives and negatives cancel out → covariance near zero (independent).
Notice that when
The variance of 2 used in the simplified worked example below is chosen for clean arithmetic. In real markets, equity ETFs have annualized variances closer to 0.04-0.09 (corresponding to volatilities of 20-30%).
- Constraint:
$w_1 + w_2 + w_3 = 1$
Before expanding the objective function, it helps to clarify what the symbols mean, because the same letter
A scalar is a single number. Each individual weight is a scalar:
-
$w_1 = 0.583$ → "put 58.3% in Asset A" -
$w_2 = 0.083$ → "put 8.3% in Asset B" -
$w_3 = 0.333$ → "put 33.3% in Asset C"
These are just ordinary numbers — you can add, subtract, and multiply them the same way you would in normal arithmetic. The momentum scores are also scalars:
A vector is a list of scalars stacked together. When you see
A vector has only one dimension — it is either a single row or a single column of numbers. The superscript
A matrix is a table of scalars with rows AND columns. The covariance matrix
| A | B | C | |
|---|---|---|---|
| A | 2 | 0 | 0 |
| B | 0 | 2 | 0 |
| C | 0 | 0 | 2 |
Why does the distinction matter? Because the risk formula
-
$w^\top$ = weights as a row: shape (1 x 3) — one row, three columns -
$\Sigma$ = covariance matrix: shape (3 x 3) — three rows, three columns -
$w$ = weights as a column: shape (3 x 1) — three rows, one column - Result: (1 x 3)
$\times$ (3 x 3)$\times$ (3 x 1) = (1 x 1) = a single number
How matrix dimension multiplication works: The rule is simple — the inner dimensions must match, and you keep the outer dimensions:
The bold 3's match (this is what makes the multiplication valid — you are pairing each column of the first with each row of the second). The underlined numbers (1 and 3) survive as the result's shape. If the inner numbers do not match, the multiplication is impossible.
Then multiply that (1 x 3) result by
Again the bold 3's match, and the outer numbers (1 and 1) give a (1 x 1) result — a single number.
Concrete example with actual numbers. Suppose
First,
Row:
| A | B | C | |
|---|---|---|---|
| A | 2 | 0 | 0 |
| B | 0 | 2 | 0 |
| C | 0 | 0 | 2 |
The rule: take the row, and multiply it against each column of the matrix one at a time. Each column produces one number in the result:
- Column 1 — multiply the row by the first column of the matrix, then add up:
- Column 2 — multiply the row by the second column, then add up:
- Column 3 — multiply the row by the third column, then add up:
Result — shape (1 x 3). Three columns in the matrix produced three numbers in the result. Each number represents one asset's risk contribution to the portfolio:
| Asset A | Asset B | Asset C |
|---|---|---|
| 1.0 | 0.6 | 0.4 |
Then, multiply that result by
| Weight | |
|---|---|
| A | 0.5 |
| B | 0.3 |
| C | 0.2 |
Result:
The whole point of the sandwich is to take a list of simple numbers (your portfolio weights), run them through a table of 78 pairwise interactions (the covariance matrix), and collapse everything into one single number — total portfolio risk. That single number is what the optimizer tries to make as small as possible. In the code (alpha_engine.py:794):
port_vol = np.sqrt(np.dot(w.T, np.dot(cov_arr, w)))By contrast, the momentum term
The weights vector:
The momentum scores vector:
Multiply each weight by the corresponding momentum score, then add up:
That single number (2.20) measures "how much total momentum is the portfolio exposed to?" A higher number means the portfolio is tilted toward high-momentum assets — which is what the optimizer wants. Since the optimizer minimizes the objective, this term is subtracted: minimizing "risk
Why risk needs a matrix but momentum does not: Risk depends on how assets interact with each other — SMH crashing together with QQQ is worse than SMH crashing while TLT rises. That pairwise interaction is what the covariance matrix captures, and why the weights must be multiplied twice (once for each side of every pair). Momentum, on the other hand, is an individual property — Asset A's momentum score does not depend on Asset B's. Each asset contributes independently, so a simple one-pass multiplication is enough.
Putting it together: The objective function computes two numbers — portfolio risk (
Building the objective function step by step:
The section above used concrete numbers (
The full objective from the main README has three terms: risk, momentum, and entropy. For this example, entropy is omitted to keep the arithmetic simple. That leaves:
The minus sign is the key: the optimizer minimizes this expression. Minimizing "risk minus momentum" simultaneously pushes risk down and momentum up — because subtracting a larger momentum number makes the whole expression smaller. (This is explained in detail in the "Why minimization achieves three goals at once" subsection above.)
Now expand each piece with the actual numbers:
Risk term
Step 1 — Multiply the covariance matrix by the weight column (
- Row 1:
$2 \times w_1 + 0 \times w_2 + 0 \times w_3 = 2w_1$ - Row 2:
$0 \times w_1 + 2 \times w_2 + 0 \times w_3 = 2w_2$ - Row 3:
$0 \times w_1 + 0 \times w_2 + 2 \times w_3 = 2w_3$
Result: a column vector (one number per asset, stacked vertically):
- Asset A:
$2w_1$ - Asset B:
$2w_2$ - Asset C:
$2w_3$
Step 2 — Multiply the weight row by that result (
So:
The zeros in the covariance matrix killed all the cross terms. To see where cross terms come from, suppose Asset A and Asset B have a non-zero covariance of 0.5 (they are somewhat correlated):
| A | B | C | |
|---|---|---|---|
| A | 2 | 0.5 | 0 |
| B | 0.5 | 2 | 0 |
| C | 0 | 0 | 2 |
Notice that 0.5 appears twice — at position (A,B) and at position (B,A) — because the covariance matrix is always symmetric.
Now redo Step 1 with this matrix — Row 1 picks up a
- Row 1:
$2 \times w_1 + 0.5 \times w_2 + 0 \times w_3 = 2w_1 + 0.5w_2$ - Row 2:
$0.5 \times w_1 + 2 \times w_2 + 0 \times w_3 = 0.5w_1 + 2w_2$ - Row 3:
$0 \times w_1 + 0 \times w_2 + 2 \times w_3 = 2w_3$
And redo Step 2 — multiply each weight by its corresponding row result:
$w_1 \times (2w_1 + 0.5w_2) = 2w_1^2 + 0.5 w_1 w_2$ $w_2 \times (0.5w_1 + 2w_2) = 0.5 w_1 w_2 + 2w_2^2$ $w_3 \times (2w_3) = 2w_3^2$
Add them all up:
The cross term
That is where the
Why the weights appear twice — the intuition behind the sandwich
The two multiplications capture two different things. Step 1 (
The cross term derivation above shows exactly why both weights are needed:
Momentum term
Subtract momentum from risk to get the objective:
Which simplifies to:
Every number in this formula traces back to either the covariance matrix (the 2's in front of the squared terms) or the momentum scores (the 3, 1, 2 being subtracted). Nothing is arbitrary.
The 3-asset example above is intentionally simple. In the real portfolio with 12 assets and non-zero correlations, the same objective expands to 102 unique terms:
Risk term (
The diagonal terms (like
Momentum term (
Entropy term H(w) — one term per asset:
The full expanded objective combines all three:
The risk term is quadratic (
Important distinction — SLSQP is not machine learning. SLSQP is classical numerical optimization (math from the 1970s). It iterates, but it does not learn anything — it is solving an equation step by step, the same way Newton's method finds a square root. Given the same inputs, it always produces the same output. The machine learning part of this project is the two PPO agents (see Section E: PPO): a regime agent (a neural network that observes 25 macro features like VIX and yield curve spreads and learns which market regime we are in) and a weight agent (a neural network that observes 103 per-asset features and learns how to adjust the SLSQP weights). These are actual ML because they have neural networks that train on experience (50,000 episodes), update their parameters via gradient descent, and generalize to unseen market conditions. The full flow is: SLSQP (math, not ML) finds optimal weights using risk, momentum, and entropy → PPO agents (ML) observe market conditions and learn to adjust those weights.
Step 1 — Build the Lagrangian. The optimizer cannot just minimize the objective freely — it must obey the rule that all weights sum to 1 (you must invest 100% of your money, no more, no less). The Lagrangian is the trick that bakes this constraint directly into the formula: take the original objective and add
Step 2 — Take all partial derivatives and set to zero:
This is 4 equations with 4 unknowns — a linear system.
Step 3 — Solve by Gaussian elimination:
From equations 1-3, express each weight in terms of
Substitute into equation 4:
Back-substitute:
Verify:
Step 4 — Interpret the result:
| Asset | Momentum Score | Optimal Weight | Interpretation |
|---|---|---|---|
| A | 3 (strongest) | 58.3% | Highest momentum → highest allocation |
| C | 2 (middle) | 33.3% | Middle momentum → middle allocation |
| B | 1 (weakest) | 8.3% | Lowest momentum → lowest allocation |
The optimizer allocated capital proportional to momentum strength — exactly the desired behavior. The multiplier
Connection to the real portfolio: The actual optimizer performs this same calculation with 12 assets instead of 3, a full 12x12 covariance matrix with non-zero correlations, and the nonlinear entropy term. Because entropy makes the objective non-quadratic, the quadratic approximation is not exact — so SLSQP must iterate, rebuilding and re-solving the subproblem 20–50 times until the answer converges.
Why entropy forces iteration: The risk term (
SLSQP iteration visualized: the orange parabola approximates the true blue curve at each point,
jumps to the parabola's minimum (green square), then re-approximates — converging to the true minimum at w = 1/e ≈ 0.368.
At the solution, each inequality constraint is either active (the solution is pressed right against the boundary) or inactive (the solution is safely inside and the constraint has no effect).
The active set method solves this through intelligent guessing:
Step 1 — Guess which constraints are active. For example: "SMH is at the 30% cap, TAN is at the 0% floor, everything else is free."
Step 2 — Convert the guess into equality constraints. If SMH is active at 30%, set
Step 3 — Solve the reduced system (fewer variables, only equality constraints). This is the Lagrange multiplier problem from Layer 3 — solvable by hand.
Step 4 — Check the guess. Are any "free" weights outside their bounds? Is any active constraint being pushed the wrong way (indicated by its multiplier's sign)? If the guess is inconsistent, update the active set and return to Step 2.
Each iteration is just solving a system of linear equations. The process typically converges in 3-10 rounds.
Everything above solves ONE quadratic subproblem — one parabola approximation at one point in weight-space. The full SLSQP process chains 20-50 of these subproblems together, where each depends on the previous answer:
- Stand at initial weights, build a parabola approximation, solve it (doable by hand)
- Walk to the answer, build a NEW parabola at the new position, solve it (doable by hand)
- Repeat until the answer stops changing (20-50 iterations)
Each iteration involves computing 12 partial derivatives, updating a 12×12 Hessian approximation, and solving a 13×13 linear system (12 weights + 1 Lagrange multiplier) via Gaussian elimination — roughly 2,000 arithmetic operations per step. At ~30 minutes per iteration by hand, and 20–50 iterations to converge, the full optimization would take 10–25 hours of continuous arithmetic: 40,000–100,000 multiply-add operations with no errors allowed, since each iteration's starting point depends on the previous iteration's answer.
The conceptual framework here — Lagrange multipliers, Hessian approximation via BFGS, active-set methods for inequality constraints — is upper-level undergraduate or graduate-level mathematics, the kind taught in financial derivatives and numerical optimization courses. These are not trivial topics. However, once the framework is set up, the execution at each iteration reduces to operations that are individually mechanical: partial derivatives, matrix-vector products, and Gaussian elimination. The difficulty is not any single operation — it is the volume and the interdependence. The computer completes the entire process in approximately 0.01 seconds.
| Level | Math required | Hand-solvable? | Time by hand |
|---|---|---|---|
| Bottom of |
High school derivative | Yes | 30 seconds |
| 12-variable quadratic, no constraints | Solve 12 linear equations | Yes | 20 minutes |
| Add "weights sum to 1" | +1 Lagrange multiplier, 13 equations | Yes | 30 minutes |
| Add per-asset bounds (0%-30%) | Active set: guess, solve, check, repeat | Yes | 1-2 hours |
| Full SLSQP (20-50 iterations of the above) | Same math, repeated many times | Technically yes | 10-25 hours |
| The original nonlinear objective (entropy + momentum) | Cannot be solved directly | No | Impossible |
The genius of SLSQP is converting an impossible problem (nonlinear with constraints) into a long series of easy problems (quadratic with linear constraints). Each easy problem is just high school math applied repeatedly. The computer does not do hard math — it does easy math very fast, very many times.
Claude Shannon invented entropy in 1948 for information theory — specifically to measure "how much surprise is in a message?" It had nothing to do with finance. But the math turns out to be useful anywhere you want to measure how spread out something is.
In the code (alpha_engine.py:782):
entropy = -np.sum(w_pos * np.log(w_pos))You have 12 assets with weights
- Take the weight (e.g., 0.30)
- Take the natural log of that weight (
$\ln(0.30) = -1.20$ ) - Multiply them together (
$0.30 \times -1.20 = -0.36$ ) - Do this for all 12 assets
- Add them all up
- Put a negative sign in front (to make the result positive)
This is the part that confuses everyone. Here is the intuition:
Then you multiply by
So
Portfolio A: All-in on one stock — SMH = 100%, everything else = 0%
Entropy = 0. Zero surprise. Zero diversity. You know exactly where all the money is.
(Note:
Portfolio B: Split between 2 stocks — SMH = 50%, QQQ = 50%
Entropy = 0.693. Some diversity.
Portfolio C: Equal across all 12 assets — each at 8.33%
Entropy = 2.485. Maximum diversity for 12 assets.
| Portfolio | Entropy | What it looks like |
|---|---|---|
| 100% in one stock | 0 | All eggs in one basket |
| 50/50 split | 0.693 | Two baskets |
| Equal 12-way split | 2.485 | Maximum spread |
Entropy goes from 0 (fully concentrated) to
Go back to the objective function:
That minus sign means the optimizer rewards higher entropy (more spread). Without it, the momentum term would happily shove 100% into the single best stock. The entropy term gently pushes back: "spread the money around a little."
But alpha_engine.py:88) is tiny, so it is a whisper, not a shout. The momentum term easily overpowers it. The result: the portfolio concentrates in the top 3-4 winners but does not go full degenerate into just 1.
The difference is where they are used:
- Entropy goes inside the objective function as a smooth, differentiable penalty. The optimizer can compute its gradient and smoothly adjust weights. It is a soft nudge during optimization.
-
Effective N (
$1/\sum w_i^2$ ) is used as a hard check after optimization. It is a pass/fail gate: "does this portfolio look like at least 3 bets?" In the code, this is computed as$e^H$ (alpha_engine.py:1054-1055):
entropy = -np.sum(w_pos * np.log(w_pos)) if len(w_pos) > 0 else 0
effective_n = np.exp(entropy)Entropy is the carrot (gentle reward for spreading). Effective N is the stick (reject the portfolio if it is too concentrated).
Shannon Entropy measures how spread out the portfolio weights are on a scale from 0 (fully concentrated) to
Code: The entire GBM simulation is implemented in the
MonteCarloSimulatorclass (alpha_engine.py:1555-1621).
Before "Geometric," we need plain Brownian Motion. Imagine a drunk person stumbling on a straight road:
- Every second, they take one step
- The step is random — drawn from a bell curve (normal distribution)
- Each step is independent of the previous one
After 100 steps, their position is the sum of 100 random steps. This is Brownian Motion — a pure random walk. Mathematically:
The
If you model stock prices with regular Brownian Motion, you get nonsense. Say a stock is at 10. After enough negative random steps, it hits 0, then -5. A stock price of negative five dollars is meaningless.
The fix: instead of adding random dollar amounts, multiply by random percentage changes. A stock can drop 50%, then drop another 50% (now at 25% of original), then another 50% (12.5%)... it keeps halving forever but never hits zero.
This is the "Geometric" part — the randomness is multiplicative, not additive.
In finance textbooks, GBM is written as a stochastic differential equation (SDE):
Here is what every symbol means:
-
$dS$ = the infinitesimal change in stock price (how much it moves in a tiny instant) -
$S$ = current stock price -
$\mu$ = drift — the expected annual return (e.g., 0.20 for 20%/year) -
$\sigma$ = volatility — the annual standard deviation (e.g., 0.25 for 25%) -
$dt$ = a tiny slice of time -
$dW$ = a Wiener process increment — a tiny random shock from a bell curve. Think of it as the universe "rolling a bell-curve die" for the stock at every instant. Most rolls land near 0 (small moves, most days), some land at$\pm 1$ (moderate moves), rare ones hit$\pm 3$ (big moves). Formally:$dW = \sqrt{dt} \times Z$ where$Z \sim \mathcal{N}(0,1)$ is a standard normal random number. The$\sqrt{dt}$ scaling is critical — it ensures that randomness accumulates correctly over time. Over 252 trading days, the daily shocks add up to exactly one year's worth of volatility. Without the square root, short time steps would produce either too much or too little total randomness.
The key insight: both terms are proportional to
Reading it as a sentence: "The change in price = (expected drift times price times time) + (random shock times price times volatility)"
Think of the stock price being pulled by two forces simultaneously:
Force 1: The Drift —
This is the predictable, deterministic part. If there were zero randomness, the stock would grow smoothly at rate
If
About 8 cents of upward drift per day. Boring but reliable.
Force 2: The Diffusion —
This is the random part.
So the random shock is:
If
The stock jumps up 2.36 dollars. Notice: the random part (2.36) completely dwarfs the drift (0.08). On any single day, the noise dominates. The drift only shows up over months and years. This is why daily stock charts look like chaos but long-term charts trend upward.
The SDE
Why can't we just plug numbers into the SDE directly?
Look at the equation again:
The trick is to stop tracking
Tracking price
The right side contains
- Day 1:
$S = 100$ . Compute change using$S = 100$ . Get$S = 103.2$ . - Day 2: use
$S = 103.2$ . Compute change using$S = 103.2$ . Get$S = 101.7$ . - Day 3: use
$S = 101.7$ . Compute change using$S = 101.7$ . Get$S = 104.1$ . - ... repeat 252 times. Every day depends on yesterday's answer.
Tracking log-price
The right side has no
The constant part — add up
Just like: walking at 5 km/h for 3 hours =
The random part — add up
Each day, the market generates one random shock:
What is
- Total variance
$= 1 + 1 + \ldots + 1 = 252$ - Total standard deviation (typical size of the sum)
$= \sqrt{252}$ , not 252
So
Where does this
Why does
-
$Z \sim \mathcal{N}(0, 1)$ : variance = 1. Numbers typically between -1 and +1. -
$3Z$ : variance$= 3^2 \times 1 = 9$ . Numbers typically between -3 and +3. -
$\sqrt{252} \times Z$ : variance$= (\sqrt{252})^2 \times 1 = 252$ . That is$\mathcal{N}(0, 252)$ .
The square root is needed because variance scales by
So
Plugging back in:
The
Combine both parts and move
One formula. One random number
Exponentiate to get the price:
The tool that performs this conversion is Ito's Lemma — the chain rule from calculus, but adapted for random processes. In normal calculus, the chain rule tells you how a function of
We want to find how
Step 1 — State the general Ito's Lemma formula.
For any smooth function
"Smooth" means
Important notation clarification:
Where does the
The
But Ito's Lemma is about the change
The
Step 2 — Choose
Step 3 — Compute
We know
Expand using three rules of stochastic calculus. These rules are not arbitrary — each has a concrete reason:
-
$(dt)^2 = 0$ — a tiny number squared is negligible. If$dt = 0.004$ (one trading day), then$(dt)^2 = 0.000016$ . Too small to matter. Same reason normal calculus ignores$(dx)^2$ . -
$dt \cdot dW = 0$ — recall from above that$dW = \sqrt{dt} \times Z$ , so the typical size of$dW$ is$\sqrt{dt}$ (since$Z$ is usually around 1). Therefore$dt \cdot dW \approx dt \times \sqrt{dt} = dt^{3/2}$ , which is even smaller than$dt$ itself. If$dt = 0.004$ , then$dt^{3/2} = 0.00025$ . Also negligible. -
$(dW)^2 = dt$ — this is the weird one that makes stochastic calculus different. Since$dW = \sqrt{dt} \times Z$ (where$Z$ is a bell-curve random number), squaring gives$(dW)^2 = dt \times Z^2$ . The key property: if you average$Z^2$ over many draws from the bell curve, the average is exactly 1. So$(dW)^2 = dt \times 1 = dt$ . This is small but NOT zero — it survives and creates the extra term that doesn't exist in normal calculus.
Applying these rules:
Only the
Step 4 — Plug everything into Ito's Lemma.
Simplify the second term:
Step 5 — Substitute
The
Step 6 — Combine the
This is the result. The right side has no
The
Forget the calculus. Here is why it has to exist:
Consider two scenarios over 2 days, both with 25% volatility:
- Path A: +25% then -25%: 100 to 125 to 93.75 (lost 6.25%)
- Path B: -25% then +25%: 100 to 75 to 93.75 (lost 6.25%)
The average return is 0% (one up, one down), but you lost money both ways. This asymmetry — percentage gains and losses do not cancel out — is the volatility drag. The
With - 0.5 * sigma_daily ** 2 (alpha_engine.py:1593).
Integrating the log-price equation over a discrete time step
Exponentiate both sides (because
Rearrange:
This is the exact formula used in the Monte Carlo simulation code (alpha_engine.py:1593-1610):
drift = mu_daily - 0.5 * sigma_daily ** 2 # μ - ½σ² (volatility drag)
Z = np.random.standard_normal(self.n_simulations) # 1M random draws from N(0,1)
daily_returns = np.exp(drift + sigma_daily * Z) # the GBM formula
current_values *= daily_returns # S_{t+1} = S_t × exp(...)Here is every piece labeled:
| Symbol | What it is | Example value |
|---|---|---|
| Today's portfolio value | 100,000 | |
| Tomorrow's portfolio value | What we are computing | |
| Annualized expected return | 0.20 (20%) | |
| Annualized volatility | 0.25 (25%) | |
| Drift corrected for vol drag | 0.20 - 0.03125 = 0.169 | |
| Time step as fraction of year | 1/252 = 0.00397 | |
| Converts annual vol to daily | 0.063 | |
| Random draw from |
-1.2 (a bad day) | |
| Ensures price stays positive | Always > 0 |
Starting value: 100,000 dollars.
Step 1 — Drift component:
Step 2 — Random component:
Step 3 — Total exponent:
Step 4 — Exponentiate:
The portfolio dropped 1,806 dollars on this simulated day. Notice how the drift (+0.067%) was completely overwhelmed by the random shock (-1.89%).
The code (alpha_engine.py:1558-1621) does this exact calculation 1,260 times (5 years times 252 trading days) for each path, and runs 1,000,000 paths in parallel using NumPy vectorization. Each path draws its own independent sequence of
After all paths complete, you have 1,000,000 final portfolio values. Sort them and you get:
- Mean = expected outcome
- 5th percentile = bad scenario (95% of paths did better)
- Percentage below starting value = probability of loss
- The full histogram = the complete probability distribution of your financial future
GBM is the same model that underlies the Black-Scholes options pricing formula (Black & Scholes, 1973) — arguably the most famous equation in finance. Black-Scholes assumes stock prices follow GBM, then uses that assumption to derive a closed-form formula for the fair price of an option.
This project uses GBM for a completely different purpose. Black-Scholes asks: "given that prices follow GBM, what should an option cost?" This project asks: "given that prices follow GBM, what are the 1 million possible futures for this portfolio?" Same underlying model, different application:
| Black-Scholes | This project | |
|---|---|---|
| Uses GBM to | Derive a closed-form option price | Simulate 1M possible price paths |
| Output | One number (the option price) | A full probability distribution of outcomes |
| Method | Analytical (solve the equation) | Monte Carlo simulation (generate random paths) |
| Application | Options pricing and hedging | Portfolio risk assessment and tail risk analysis |
GBM models stock prices as a random walk in log-space. The price change each day comprises two components: a deterministic drift (expected return adjusted for volatility drag) and a stochastic diffusion (random shock scaled by volatility). The
The system has two decisions to make every rebalance day: (1) which regime are we in? and (2) what should the 12 portfolio weights be? Each decision has a classical approach and an RL approach — that is the "Dual" in Alpha Dual Engine.
| Classical (Section B + rule-based regime) | RL (Section E: PPO) | |
|---|---|---|
| Regime decision | Rule-based classifier: SPY > 200-SMA → RISK_ON, else check ml_prob | Regime agent: neural network observes 25 macro features → picks 1 of 3 regimes |
| Weight decision | SLSQP optimizer: solves risk − momentum − entropy equation | Weight agent: neural network observes 103 per-asset features → outputs 12 weights |
| How it decides | Solves equations (math, not learning) | Learns from 50,000 simulated episodes of trial and error |
| Deterministic? | Yes — same inputs always give same output | No — samples from distributions (exploration), but converges over training |
| Adapts over time? | No — fixed formulas | Yes — neural networks update parameters based on rewards |
Where does Section D fit? Section D (GBM / Monte Carlo) is not part of either decision. It runs after the weights are already chosen — it takes the final portfolio weights and simulates 1 million future price paths to assess risk (tail losses, drawdown probabilities, etc.). It is a downstream evaluation step, not a competing weight engine.
Why build two approaches? The classical path is the reliable baseline — SLSQP is mathematically guaranteed to find the optimal weights for its objective function, and the rule-based regime classifier is simple and interpretable. The RL path is the ambitious alternative — it can potentially learn patterns that fixed formulas cannot capture (like "when VIX spikes, rotate to bonds faster than the momentum signal suggests").
The Streamlit sidebar exposes two toggles under Advanced Options:
| Toggle | What it swaps | Regime | Weights |
|---|---|---|---|
| (neither) | — | Rule-based (SPY > 200-SMA) | SLSQP optimizer (default) |
| Use RL Agent (PPO) | Regime only | RL regime agent · Sec D | SLSQP optimizer |
| Use Hierarchical RL (Regime + Weights) | Both stages | Rule-based* | RL weight agent · Sec D |
* The hierarchical controller internally bypasses the RL regime agent and uses the rule-based classifier (regime allocation: 84.2% RISK_ON, 13.5% RISK_REDUCED, 2.3% DEFENSIVE). When "Use RL Agent (PPO)" is toggled on instead, the RL regime agent shows a RISK_REDUCED bias of 57.1% (36.2% RISK_ON, 6.8% DEFENSIVE) — it defaults to the cautious middle ground instead of going fully RISK_ON during bull markets. The rule-based classifier with its simple SPY > 200-SMA master switch remains more reliable. If both toggles are enabled, hierarchical takes precedence. Note: the RL weight agent does not enforce the winner-take-all crypto RSI rotation used by SLSQP — it can freely split the crypto bucket between BTC and ETH.
You have an agent that observes the world (market data), takes actions (pick a regime or pick portfolio weights), and receives rewards (excess Sharpe minus penalties). The goal: find the policy (the rule mapping observations to actions) that maximizes cumulative reward.
The Alpha Dual Engine has two PPO agents:
- High-level (discrete): observes 25-dim state and picks 1 of 3 regimes (RISK_ON / RISK_REDUCED / DEFENSIVE)
- Low-level (continuous): observes 103-dim state and outputs 12 portfolio weights
How are regimes actually classified? In production the RL regime agent is bypassed (it shows a 57.1% RISK_REDUCED / 36.2% RISK_ON / 6.8% DEFENSIVE distribution — too cautious compared to the rule-based baseline's 84.2% RISK_ON), so a rule-based classifier decides the regime. The logic is a two-node decision tree:
- Is SPY above its 200-day moving average? → RISK_ON. Full stop — no other checks. This is the master switch.
- SPY is below its 200-day SMA → check the XGBoost crash probability (
ml_prob):
ml_prob > 0.55→ RISK_REDUCED (the model thinks a crash is slightly more likely than not, so reduce exposure)ml_prob <= 0.55→ DEFENSIVE (not enough bullish signal to warrant risk)There is no gradual spectrum — it is a hard decision tree. The 200-SMA trend is the dominant signal, and the ML probability is only a tiebreaker when the trend is already bearish. See
get_regime()inalpha_engine.py.
"25-dim state" and "103-dim state" simply mean a list of 25 or 103 numbers that describe the current market conditions — each number is one "dimension" (one piece of information the agent can see). The 103 dimensions of the weight agent break down as:
| Dims | Count | What the agent sees |
|---|---|---|
| [0:3] | 3 | Regime one-hot (which regime the high-level agent chose) |
| [3:15] | 12 | Per-asset raw momentum |
| [15:27] | 12 | Per-asset volatilities |
| [27:39] | 12 | Per-asset RSI-14 |
| [39:51] | 12 | Per-asset above-SMA (binary: is the price above the 60-day moving average?) |
| [51:63] | 12 | Per-asset golden cross (binary: did the short-term SMA cross above the long-term?) |
| [63:75] | 12 | Per-asset information ratio |
| [75:87] | 12 | Per-asset 30-day log returns |
| [87:99] | 12 | Current portfolio weights |
| [99] | 1 | ML probability (XGBoost crash probability) |
| [100] | 1 | Current drawdown |
| [101] | 1 | Days since last rebalance |
| [102] | 1 | Portfolio value |
Each rebalance day, these 103 numbers are assembled into one vector and fed to the neural network. The agent's job is to look at all 103 numbers and output 12 weights.
The 25 dimensions of the regime agent break down as:
| Dims | Count | What the agent sees |
|---|---|---|
| [0:7] | 7 | ML features (realized vol, vol momentum, equity risk premium, trend score, 21d momentum, QQQ vs SPY, TLT momentum) |
| [7:13] | 6 | Portfolio-aggregate (equity weight, safe-haven weight, crypto weight, normalized portfolio value, days since rebalance, drawdown) |
| [13:20] | 7 | Cross-asset summaries (mean equity momentum, pct above SMA, mean volatility, BTC golden cross, TLT above SMA, mean information ratio, ML probability) |
| [20:25] | 5 | Recent performance (5d / 21d / 63d portfolio returns, 21d benchmark returns, 21d excess returns) |
Notice the difference: the weight agent sees per-asset detail (12 numbers per feature — one for each asset), while the regime agent sees aggregate summaries (one number for the whole market). This makes sense — the regime agent only needs to answer "what's the big picture?" (risk-on / risk-off / defensive), so it gets macro-level signals. The weight agent needs to decide how much of each specific asset to hold, so it gets asset-level signals.
The math is the same for both agents. Here is the full derivation from scratch.
Softmax appears twice in this system: once in the regime agent (turning logits into regime probabilities) and once in the weight agent (turning raw samples into portfolio weights). Both uses solve the same problem: you have a list of arbitrary numbers that can be negative, huge, or tiny, and you need to convert them into positive numbers that sum to exactly 1.
The formula:
Two steps: (1) raise
Why
Worked example — 3 assets to keep it simple (the real system does 12):
| Asset | Raw value |
Softmax = |
|
|---|---|---|---|
| SMH | 1.2 |
|
|
| TLT | -0.5 |
|
|
| GLD | 0.4 |
|
|
| Total | 5.45 | 1.00 (100%) |
Key properties:
-
Negative inputs still get positive output. TLT had
$x = -0.5$ , but$e^{-0.5} = 0.61$ is still positive — softmax never produces zero. - Bigger gaps → more lopsided output. SMH's raw value (1.2) is only 1.7 more than TLT's (-0.5), but it gets 6× the weight. The exponential amplifies differences.
-
Equal inputs → even split. If all 12 raw values were identical, softmax would give each asset exactly
$1/12 \approx 8.3%$ . - Softmax is deterministic. Same inputs always produce the same outputs — there is no randomness here.
In the actual code, the weight agent's softmax is at rl_weight_agent.py:484-487:
z_shifted = z - z.max() # numerical trick to prevent overflow
exp_z = np.exp(z_shifted) # step 1: raise e to each value
weights = exp_z / exp_z.sum() # step 2: divide by totalThe policy is the agent's decision-making rule — it looks at market data and decides what to do. Concretely, it is a neural network: a function with thousands of internal numbers (called parameters, written as
Discrete case (regime agent): The network outputs logits
This is the same softmax operation explained in the subsection above — it converts 3 raw logits into 3 probabilities that sum to 1. If the logits are
$[2.0, 0.5, -1.0]$ , softmax gives$[0.82, 0.18, 0.04]$ — the agent strongly prefers RISK_ON.
Reading the notation:
Continuous case (weight agent): The regime agent picks from 3 options — that is a discrete choice. But the weight agent outputs 12 continuous numbers (portfolio weights). The policy itself cannot be softmax here because there is no finite menu to pick from — the agent needs to output any combination of 12 numbers. So it uses bell curves instead: the network outputs a mean
The network outputs two things:
-
$\mu$ — the mean vector: a list of 12 numbers representing "my best guess" for each weight. This is the center of 12 bell curves. -
$\log{\sigma}$ — the log standard deviation: how wide each bell curve is. Big$\sigma$ = "I'm unsure, try wildly different values." Small$\sigma$ = "I'm pretty confident, stay close to$\mu$ ."
Step 1 — Sample from the bell curves. For each of the 12 assets, generate a random number near the network's best guess:
What does "draw from a bell curve" actually mean? It happens in two steps. First, the computer generates a standard random number — a number from a bell curve centered at 0 with width 1. This is a built-in function that every programming language provides (like np.random.randn() in Python), the same way a calculator has a built-in square root button. Second, it shifts and scales: multiply by
In the actual code, this is a single line (rl_weight_agent.py:477):
z = mu_np + std_np * np.random.randn(self.n_assets)np.random.randn(12) generates 12 standard random numbers, std_np * scales them, and mu_np + shifts them.
That's it. If
Example: suppose the network's best guess for SMH is
Step 2 — Softmax the samples to get actual weights that sum to 1:
The 12 raw samples from Step 1 are arbitrary numbers — they can be negative and don't sum to 1. Softmax (explained in the subsection above) converts them into valid portfolio weights: all positive and summing to exactly 1. The randomness doesn't come from softmax — softmax is deterministic. The randomness came from Step 1, where each
This is how the agent explores. Here is what "trying" looks like concretely: the agent sees today's market data, draws random weights from its bell curves, and those weights are used to run a simulated quarter of trading. At the end of the quarter, the simulation produces a reward (based on Sharpe ratio minus drawdown penalties). If the random draw happened to put 20% in SMH and that worked well, the agent nudges
Does rl_weight_agent.py:1127):
Force 1 — The policy gradient wants policy_loss at line 1119 — it computes how much better or worse each action was compared to average, and pushes the policy to repeat the good ones. Repeating good actions means less randomness, which means smaller
Force 2 — The entropy bonus wants
Not the same entropy as Section C. The word "entropy" appears in two completely different places in this system. Section C uses Shannon entropy
$H(\mathbf{w}) = -\sum w_i \ln w_i$ — it measures how spread out the portfolio weights are (diversification). Here in Section E, the entropy is Gaussian entropy$H = \frac{1}{2}(1 + \ln{2\pi}) + \ln{\sigma}$ — it measures how wide the agent's bell curves are (exploration randomness). They share the name because both come from information theory (Shannon's idea of measuring "uncertainty"), but they measure different things: one asks "is the money spread across many assets?", the other asks "is the agent still trying new things?"
In the code (line 1124):
entropy = 0.5 * n_assets * (1 + log(2*pi)) + sum(log_std)Where does this formula come from? It is the textbook entropy of a Gaussian (bell curve) distribution. For a single bell curve with width
This comes from the definition of continuous entropy — "take the bell curve formula, multiply each height by its own log, integrate over the entire curve." The integral has a known closed-form answer (derived in any probability textbook), and the result is the formula above. The important thing is what each piece means:
| Piece | Value | What it does |
|---|---|---|
| ≈ 1.42 | A constant — same for every bell curve regardless of width. Does not change during training. | |
| varies | The only part that changes. When |
So entropy is really just tracking
The weight agent has 12 independent bell curves (one per asset), so the total entropy is the sum across all 12:
Which is exactly what the code computes: 0.5 * n_assets * (1 + log(2*pi)) + sum(log_std) where n_assets = 12 and log_std is
How the entropy bonus enters the loss. The loss function subtracts this entropy term (line 1127):
total_loss = policy_loss + vf_coef * value_loss - ent_coef * entropyThe minus sign means: higher entropy → lower loss → the optimizer is rewarded for keeping the bell curves wide. The coefficient ent_coef = 0.10 (line 1080) controls how strongly this force pushes back — at 0.10, it is a moderate nudge, not an overwhelming force.
The gradient — why it's a constant push. Let's work out exactly how the entropy bonus pushes on
Now take the derivative with respect to a single asset's
Step by step:
- The constant
$\frac{12}{2}(1 + \ln{2\pi})$ doesn't depend on$\ln{\sigma_j}$ at all, so its derivative is 0 — it vanishes. - The sum $\ln{\sigma}_1 + \ln{\sigma}2 + \cdots + \ln{\sigma}{12}$ contains exactly one term with
$\ln{\sigma}_j$ . The derivative of$\ln{\sigma}_j$ with respect to itself is 1. All the other terms ($\ln{\sigma}_1, \ln{\sigma}_2$ , etc.) don't contain$\ln{\sigma}_j$ , so they also vanish.
What's left:
This is the same for every asset, at every point in training, regardless of what log_std. In gradient descent, a negative gradient means "increasing this parameter decreases the loss" — so the optimizer is always nudged toward making
Why bother keeping
log_std initialized to 0, so
Step 3 — Compute the log probability of the specific sample. This asks: "how likely was this particular roll?" It is the standard bell curve (Gaussian) formula, written in log form:
Each of the three parts inside the sum has a concrete meaning:
| Part | Plain English |
|---|---|
| How far was the roll from the center, in standard deviations? Farther = less likely. | |
| Wider bell curves spread probability thinner, so any specific value is less likely. | |
| A constant that makes the math work out (the bell curve normalization factor). |
Why does PPO need this probability? Because PPO learns by asking "that action got a good reward — was it a lucky fluke (low probability roll) or my deliberate choice (high probability roll)?" The ratio of new probability to old probability is exactly the
As training progresses, rl_weight_agent.py.
Go back to the network architecture from Step 1. The policy head outputs actions (regime logits or portfolio weights) — that is the Actor, the part that decides what to do. But the same network has a second output that we have not discussed yet: the value head. It outputs a single number
Why does the agent need this? Consider an analogy: a poker player who only remembers whether each hand won or lost, but not whether the hand was expected to win. They cannot distinguish a smart fold from a cowardly one, or a lucky bluff from a skilled read.
Concrete example: Suppose the market is in a strong uptrend.
How the network is structured. Both agents use a shared trunk — two hidden layers that process the observation into a compressed representation — with two separate output branches:
Observation (103 numbers for weight agent, 25 for regime agent)
│
▼
┌──────────────────────┐
│ Shared Layer 1 │ 103 → 128 neurons (weight agent)
│ tanh activation │ 25 → 64 neurons (regime agent)
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Shared Layer 2 │ 128 → 128 neurons (weight agent)
│ tanh activation │ 64 → 64 neurons (regime agent)
└──────────┬───────────┘
│
┌─────┴─────┐
▼ ▼
┌─────────┐ ┌─────────┐
│ Policy │ │ Value │
│ Head │ │ Head │
│ (Actor) │ │(Critic) │
└─────────┘ └─────────┘
│ │
▼ ▼
Actions V(s)
-
Policy head (Actor): outputs the action — 3 regime logits (
rl_regime_agent.py:568) or 12 weight means$\mu$ (rl_weight_agent.py:448) -
Value head (Critic): outputs a single number
$V(s)$ — the estimated future reward from this state (rl_weight_agent.py:449)
Why share layers? The first two layers learn to compress the raw observation into useful features ("the market is trending up", "volatility is spiking"). Both the actor and the critic need this same understanding of the market state, so sharing the computation is efficient. The heads then specialise: the actor uses those features to pick actions, the critic uses them to estimate value.
What does rl_weight_agent.py:455-458), the value head is:
h = tanh(shared1(x)) # 103 inputs → 128 neurons → squish
h = tanh(shared2(h)) # 128 → 128 → squish
value = value_head(h) # 128 → 1 numberWhat does each layer actually do? Think of it as an assembly line with 3 stations:
Station 1 — Take the 103 raw market numbers (momentum, volatility, RSI, etc. for each of the 12 assets, plus portfolio state). Multiply each by a learned weight, add them up in 128 different combinations (each combination emphasizes different inputs), add a bias to each, then squish every result through
"Weights" means two different things here. The word "weights" appears in two completely unrelated contexts in this system. Portfolio weights are the 12 numbers that say "put 15% in SMH, 10% in TLT, ..." — these are the output of the weight agent, the final allocation. Network weights (
$W_1, W_2, W_3$ ) are the ~33,000 internal parameters inside the neural network that determine how it processes inputs — these are the knobs that gradient descent adjusts during training. When Station 1 says "multiply each by a learned weight," it means the network weights (the internal knobs), not portfolio allocations. The ML world calls them "weights" because they weight how much each input matters; the finance world calls allocations "weights" because they weight how much money goes to each asset. Same word, different concept.
Why 128? It is a design choice — the hidden parameter at rl_weight_agent.py:443. It could be 64, 256, or any number. Too small (say 16) and the network cannot learn complex patterns from 103 inputs — it is forced to throw away too much information. Too large (say 1024) and the network has far more capacity than needed, trains slower, and is more likely to memorize the training data instead of learning general patterns (overfitting). 128 is a common middle ground for RL problems of this size. The regime agent uses 64 because it only has 25 inputs — smaller input, smaller hidden layer needed. There is no deep math here — it is a hyperparameter picked by trial and error.
Station 2 — Take those 128 intermediate numbers and do the same thing again: 128 new combinations, new biases, squish through
Station 3 — Take the 128 numbers from Station 2, multiply each by one final weight, add them up into a single number, add a bias. That single number is
Chaining all three stations together gives the full formula:
That is roughly 33,000 learned parameters (value_loss = (V(s) - actual_return)² part of the loss function.
There is no way to look at those 33,000 numbers and intuit what the function "means." It is a black box that learned to map market observations to expected rewards. That is the nature of neural networks.
What does
Single tanh (1 layer): Two stacked tanh (2 layers):
1 ┤ ──────── ┤ ╭──╮
│ / │ / \ ╭───
│ / │ / \ /
0 ┤──────/ 0 ┤───╯ ╰──╯
│ / │
│ / │
-1 ┤──── -1 ┤
└────────────────── └──────────────────
With 1 layer,
What is
A few examples:
| Input |
What happened | |
|---|---|---|
| 0 | 0 | Zero stays zero |
| 1 | 0.76 | Positive, but pulled toward 1 |
| 3 | 0.995 | Almost 1 — big positives get clamped |
| -2 | -0.96 | Almost -1 — big negatives get clamped |
| 100 | 1.00 | Completely saturated at the ceiling |
The shape looks like an S-curve: flat at the top (+1), flat at the bottom (-1), steep in the middle around 0.
Why does the network need it? Without
Why
This is called the Actor-Critic architecture. The Actor decides what to do. The Critic judges how good the current situation is — not how good the action was (that comes from the advantage in Step 3).
Suppose the agent picks portfolio weights and earns a reward of 2.5. Is that good? It depends on context. If the market was booming and the agent usually earns 3.0 in booms, then 2.5 is actually below average — the agent did worse than expected. But if the market was crashing and the agent usually earns 1.0 in crashes, then 2.5 is excellent — the agent found a great allocation despite bad conditions.
The advantage captures exactly this: how much better (or worse) was this specific action compared to what the agent normally gets from this situation?
If
This is why we built
The simplest way to measure advantage is to look at what happened in a single step. At step
| Piece | What it is |
|---|---|
| The actual reward the agent earned this step | |
| The critic's estimate of all future rewards from the new state, discounted | |
| The critic's original estimate before the agent acted |
Think of it this way. Before the agent acts, the critic looks at the current state and makes a guess: "from here, I expect a total of
Then the agent acts, and one step of reality plays out. The agent earns an actual reward
The TD error asks: "was the updated picture (actual reward + new estimate of the future) better or worse than the original estimate?" If it's higher, the situation turned out better than predicted — the action was above average. If it's lower, worse than predicted.
The
In the code (rl_weight_agent.py:554):
delta = rewards[t] + gamma * next_value * next_non_terminal - values[t]The next_non_terminal part handles episode endings: if the episode is over (done = True), there is no future, so next_non_terminal = 0 and the future term drops out.
Concrete example: The agent is in a bull market. The critic says
Positive — the outcome (reward + estimated future = 2.285) was slightly better than predicted (2.0). The action was a bit above average.
The TD error
The other extreme is to wait until the entire episode finishes and use the total real return instead of the critic's estimate. This captures all consequences but is very noisy — random events later in the episode contaminate the signal about whether this specific action was good.
Generalized Advantage Estimation (GAE) is a compromise: look at TD errors from multiple future steps, but weight nearby steps more heavily than distant ones:
Where do rl_weight_agent.py:1077-1078):
gamma = 0.99 # discount factor — same γ from the TD error above
gae_lambda = 0.95 # GAE blending — how many future steps to includehidden = 128 or ent_coef = 0.10.
Their product
| Steps ahead | Weight | How much it counts |
|---|---|---|
| 0 (this step) | 1.00 | 100% |
| 1 | 0.94 | 94% |
| 5 | 0.73 | 73% |
| 10 | 0.54 | 54% |
| 20 | 0.29 | 29% |
| 50 | 0.04 | 4% — barely matters |
So the advantage is mostly about what happened in the next few steps, with a fading tail of future information. The
In the code (rl_weight_agent.py:555), this is computed efficiently by working backwards:
last_gae = delta + gamma * gae_lambda * next_non_terminal * last_gaeWhy backwards? The naive approach would be to compute each step's advantage from scratch using the full sum — but that repeats a lot of work. There is a shortcut. Say there are 4 steps. The full GAE formula gives:
A₃ = δ₃ (last step — nothing after)
A₂ = δ₂ + 0.94 × δ₃ (this + decayed next)
A₁ = δ₁ + 0.94 × δ₂ + 0.94² × δ₃ (this + next + next-next)
A₀ = δ₀ + 0.94 × δ₁ + 0.94² × δ₂ + 0.94³ × δ₃ (all four steps)
But notice that each one contains the previous:
A₃ = δ₃
A₂ = δ₂ + 0.94 × A₃ ← just reuse A₃
A₁ = δ₁ + 0.94 × A₂ ← just reuse A₂
A₀ = δ₀ + 0.94 × A₁ ← just reuse A₁
Each step's advantage = its own TD error + 0.94 × the next step's advantage. So start at the end and work backwards: last_gae = delta + 0.94 * last_gae.
"But how can it start from the end without knowing the beginning?" It doesn't need the beginning. Think of it like building a snowball rolling uphill:
Step 3 (last): last_gae = δ₃ ← just this step
Step 2: last_gae = δ₂ + 0.94 × δ₃ ← add one layer
Step 1: last_gae = δ₁ + 0.94 × (δ₂ + 0.94 × δ₃) ← add another
Step 0 (first): last_gae = δ₀ + 0.94 × (everything above) ← full sum
At step 3, it only needs δ₃ — nothing comes after the last step. At step 2, it only needs δ₂ and the last_gae it already computed (which contains δ₃). Each step only needs two things: its own TD error and the running total from everything after it. By the time it reaches step 0, the snowball contains everything.
The next_non_terminal part handles episode endings — if the episode ended at step
The problem PPO solves. Imagine the agent tries some portfolio weights and gets a great reward. The simplest learning rule is: "that worked well, do way more of it next time." So the agent massively increases the probability of those weights. But here is the danger — that one great result might have been partly luck (a favourable random draw in Step 1). By swinging the policy hard toward that one experience, the agent might completely wreck its behaviour for all other market conditions. It goes from decent to catastrophically broken in a single update.
This actually happens in practice. The pre-PPO approach (called "vanilla policy gradient") is:
Breaking it down piece by piece:
| Symbol | What it is | Where it comes from |
|---|---|---|
| The loss — the single number gradient descent tries to make smaller | This is the policy loss, one of the three losses in Section 0 | |
| "Average over all the actions in the batch" | The agent collects many (state, action, reward) tuples; this averages across them | |
| The log probability of the action the agent took | Computed in Step 3 of the policy — "how likely was this specific action under the current policy?" | |
| The advantage — was this action better or worse than average? | Computed in Step 3 above (TD error + GAE) | |
| The |
Flips the direction — minimizing |
Gradient descent minimizes, but we want to maximize good actions |
What does the whole thing do? For each action the agent took, multiply its log probability by its advantage. If
The problem is there is no limit on how much the policy can change in one step. A single action with a large advantage can swing the entire policy, destroying behaviour that took thousands of steps to learn. PPO's entire purpose is to prevent this: learn from good and bad experiences, but never change the policy too much in one update.
Why there is a "new" and "old" policy. Training does not happen action-by-action. Instead, the agent first collects a batch of experience — it plays through 256 steps of market data using its current policy, recording every (state, action, reward) along the way. During this collection phase, the policy is frozen — it does not change. Call this frozen snapshot the "old" policy.
After collecting the batch, training begins. Gradient descent starts tweaking the network's 33,000 parameters to improve the policy. But it does not collect new data — it reuses the same 256 steps it already collected. This is efficient (no need to re-simulate the market) but creates a problem: after a few gradient updates, the parameters have changed, so the policy is no longer the same as when the data was collected. The updated policy is the "new" policy.
This matters because the data was collected by the old policy. If the new policy drifts too far from the old one, the collected data becomes misleading — the actions in the batch were chosen by the old policy's logic, so they might not be representative of what the new policy would do. It is like studying for an exam using someone else's notes — useful if their approach is similar to yours, but misleading if it's completely different. PPO's clipping keeps the new policy within ±20% of the old one, so the collected data stays relevant.
In the code (rl_weight_agent.py:1075-1076), these are the batch size and the number of training passes over that same batch:
n_steps = 256 # collect 256 steps with the old policy
n_epochs = 6 # then train on that same batch 6 timesn_steps = 256 — the agent plays through 256 rebalancing steps using the current (frozen) policy. At each step it sees a market state, picks portfolio weights, gets a reward, and moves to the next state. All 256 (state, action, reward) tuples are stored in a buffer. During this phase, the network parameters do not change — the agent is just collecting data, not learning. Think of it as a student taking 256 practice exams before sitting down to study.
n_epochs = 6 — now the agent studies. It takes those same 256 stored experiences and trains on them 6 times. Each pass through the batch is one "epoch." In the first epoch, gradient descent tweaks the parameters a bit. In the second epoch, it tweaks them a bit more, using the same 256 experiences. By the 6th epoch, the policy may have drifted noticeably from the frozen snapshot that collected the data — which is exactly why the clipping exists (to prevent it from drifting too far).
After all 6 epochs, the agent throws away the old buffer, collects a fresh 256 steps with the now-updated policy, and repeats. This collect → train → collect → train cycle continues until training ends (total_timesteps = 200,000).
The two questions PPO asks. For every action in the batch, PPO checks:
-
Was this action good or bad? That is the advantage
$A_t$ from Step 3. Positive = better than expected, negative = worse. - How much has the policy already drifted? Compare the probability of this action under the new (updated) policy vs. the old (frozen) policy that actually collected the data:
At the start of training (before any gradient updates), the new and old policies are identical, so
|
|
What it means |
|---|---|
| 1.0 | The policy hasn't changed yet — new and old are identical |
| 1.3 | The new policy is 30% more likely to take this action than the old one was |
| 0.7 | The new policy is 30% less likely to take this action |
In the code (rl_weight_agent.py:1116), this ratio is computed in log space (subtracting logs then exponentiating is the same as dividing:
ratio = mx.exp(new_log_probs - old_logprob_batch)The clipping rule. Now PPO combines the advantage and the ratio, but with a safety limit. The rule is:
- If the action was good (
$A_t > 0$ ): let the policy increase the probability of this action, but cap$r_t$ at 1.2. Once the new policy is already 20% more likely to take this action, stop pushing — you've changed enough. - If the action was bad (
$A_t < 0$ ): let the policy decrease the probability, but floor$r_t$ at 0.8. Once the new policy is already 20% less likely, stop pushing.
The 0.2 margin comes from clip_range = 0.2 (rl_weight_agent.py:1079). It is a design choice — 0.2 means "the policy can change by at most ±20% per update."
Concrete example. The agent put 25% in SMH and got a great result (
| Update step | Clipped? | What happens | ||
|---|---|---|---|---|
| Start | 1.0 | 2.0 | No | Normal learning |
| After 2 nudges | 1.1 | 2.2 | No | Still learning |
| After 5 nudges | 1.2 | 2.4 | Hits cap | Gradient goes to 0 — stop |
| After 6 nudges | 1.3 | would be 2.6 | Clipped to 1.2 | No further change |
Without clipping,
The formal formula says exactly this:
| Piece | What it does |
|---|---|
| The unclipped version — ratio × advantage, no safety limit | |
| The clipped version — if |
|
| Take whichever is smaller — always pick the more conservative update | |
| Negate so gradient descent minimizes (same trick as the vanilla formula above) |
Worked example — good action, policy has drifted too far. Say
Step 1 — Compute the unclipped version:
Step 2 — Compute the clipped version. First, apply the clip function to
Then multiply by the advantage:
Step 3 — Take the min of the two:
Step 4 — Negate:
The unclipped version wanted to give a score of 2.6, but the clipped version capped it at 2.4. The
Worked example — policy hasn't drifted much. Same good action (
Both versions agree — no clipping needed. The gradient flows normally and the policy keeps updating.
Worked example — bad action. Now
Again, the clipped version is chosen by the
The pattern: clipping never kicks in when
In the code (rl_weight_agent.py:1117-1119):
surr1 = ratio * advantage_batch # unclipped
surr2 = mx.clip(ratio, 1.0 - clip_range, 1.0 + clip_range) * advantage_batch # clipped
policy_loss = -mx.minimum(surr1, surr2).mean() # take the conservative onePPO's total loss combines three components into one number (as previewed in Section 0):
Which maps directly to the code at rl_weight_agent.py:1127:
total_loss = policy_loss + vf_coef * value_loss - ent_coef * entropy| Term | Code name | What it does | Coefficient |
|---|---|---|---|
policy_loss |
Makes the Actor better at choosing actions (Step 4) | 1.0 (no scaling) | |
value_loss |
Makes the Critic better at predicting rewards (Part 2 below) | 0.5 (vf_coef) |
|
entropy |
Keeps the agent exploring (Part 3 below) | 0.10 for weight agent, 0.05 for regime (ent_coef) |
The
What we just covered in Step 4. Makes the Actor better at choosing actions — increase the probability of actions that were better than expected, decrease for worse, but never change more than ±20% per batch.
Remember the value head from Step 2 — the Critic that outputs
The value loss trains the Critic by asking: "after the episode played out, how far off was your prediction?"
This is the same mean squared error from Section 0 — the house price example where
-
$V_\theta(s_t)$ = the Critic's guess — "from this state, I predict total reward of X" -
$R_t$ = the actual target return — what really happened (computed from the rewards the agent actually earned) -
$(V_\theta(s_t) - R_t)^2$ = squared error for one step — how wrong the guess was -
$\frac{1}{N}\sum_t$ = average across all$N$ steps in the batch
Where does the target
Concrete example: The Critic looked at a bull market and predicted
In the code (rl_weight_agent.py:1121):
value_loss = ((values - return_batch) ** 2).mean()The coefficient vf_coef at line 1081) scales this loss down by half before adding it to the total — this prevents the Critic's updates from overwhelming the Actor's updates.
This is the same entropy bonus explained in detail in Force 2 of the σ tug-of-war earlier. Here is a quick recap of why it's in the loss function and how the two agents use it differently.
The problem it solves. Without the entropy bonus, the agent would quickly become overconfident — the regime agent might learn to pick RISK_ON 99% of the time, and the weight agent's bell curves would collapse to near-zero width. In both cases, the agent stops exploring and gets stuck with whatever strategy it found first, even if something better exists.
How it works. The entropy bonus measures how "spread out" the agent's decisions are. High entropy = lots of randomness (still exploring). Low entropy = predictable (locked in). The loss function subtracts the entropy, which means higher entropy → lower loss → the optimizer is rewarded for keeping the agent's choices spread out.
The two agents compute entropy differently:
Regime agent (discrete — picking 1 of 3 regimes). The entropy is Shannon entropy applied to the regime probabilities:
Concrete example: if the regime agent outputs probabilities [0.80, 0.15, 0.05] for [risk-on, risk-reduced, defensive]:
If the probabilities were perfectly even [0.33, 0.33, 0.33], entropy would be
Weight agent (continuous — outputting 12 bell curves). The entropy is the Gaussian entropy derived earlier:
This depends entirely on
The coefficients. The entropy bonus is scaled by ent_coef before being subtracted from the loss:
| Agent | ent_coef |
Why |
|---|---|---|
| Regime agent | 0.05 | 3 discrete choices — doesn't need as much encouragement to explore |
| Weight agent | 0.10 (line 1080) | 12 continuous numbers — much easier for the bell curves to collapse, so needs a stronger push |
The weight agent gets double the entropy bonus because continuous action spaces are much more prone to collapsing. With only 3 choices, the regime agent can't get that stuck. With 12 independent bell curves, the weight agent can narrow all of them to near-zero and lock into one rigid allocation — the higher coefficient fights this harder.
Here is how it all fits together in each iteration:
1. Collect experience (n_steps = 128 or 256 steps):
- Run the current policy in the environment
- At each step, store the tuple
$(s_t, a_t, r_t, V(s_t), \log \pi(a_t \mid s_t))$ in the rollout buffer - When an episode ends (full backtest traversal), reset the environment and continue collecting
2. Compute advantages using GAE:
- Get the critic's value estimate for the final state (bootstrap)
- Walk backwards through the buffer computing
$\delta_t$ and GAE advantages
3. Normalize advantages to zero mean, unit variance:
This is like converting test scores to a curve. Suppose the raw advantages in a batch are [+50, +48, +52, +0.1, +0.3]. The first three actions would completely dominate the gradient update, and the last two would barely matter — even though those last two might contain useful information about what to avoid. After normalization, they become something like [+0.8, +0.6, +1.0, -1.2, -1.0] — all roughly the same magnitude, so every action gets a fair say in the update.
The formula does three things:
- Subtract the mean (the average of all advantages in the batch) — now the average advantage is 0. Actions that were above average become positive, below average become negative.
- Divide by the standard deviation (the spread of advantages in the batch) — now the spread is 1. No single action can dominate by being 100× larger than the others.
- Add 0.00000001 to the denominator — just a safety net to prevent dividing by zero if all advantages happen to be identical.
In the code (rl_weight_agent.py:1166-1167):
adv_mean, adv_std = advantages.mean(), advantages.std() + 1e-8
advantages = (advantages - adv_mean) / adv_std4. PPO update (run n_epochs = 6 passes over the collected data):
You just collected 256 steps of experience. Instead of feeding all 256 into the loss function at once, the code randomly shuffles them and chops them into 4 mini-batches of 64. Think of it like shuffling a deck of 256 cards and dealing 4 hands. Why not use all 256 at once? Smaller batches mean more frequent weight updates (4 updates per pass instead of 1), and the randomness from shuffling helps the network avoid getting stuck in a rut.
For each mini-batch of 64 steps, the code does three things:
- Compute the loss — run the Step 5 combined loss function on those 64 examples (clipped policy loss + value loss + entropy bonus)
- Backpropagate — the computer calculates "if I nudge each of the ~33,000 network weights up or down slightly, how would the loss change?" This gives a direction to move each weight (called the gradient)
-
Update the weights — move each weight a tiny step in that direction. The Adam optimizer decides the step size (learning rate
lr = 0.0001)
After processing all 4 mini-batches, the code checks: is the total gradient magnitude larger than 0.5? If so, it shrinks the gradient proportionally down to exactly 0.5. This is a speed limit on how fast the weights can change in a single update — without it, one weird mini-batch could produce a huge gradient that wrecks the network in one step.
# Gradient clipping (rl_weight_agent.py:1199-1203)
if gn > max_grad_norm: # max_grad_norm = 0.5
scale = max_grad_norm / gn # e.g. if gn = 2.0, scale = 0.25
grads = grads * scale # shrink all gradients proportionallyThen the entire shuffle-and-process cycle repeats 6 times (n_epochs = 6) on the same 256 steps:
| Per epoch | Total (6 epochs) | |
|---|---|---|
| Mini-batches processed | 4 (= 256 / 64) | 24 |
| Weight updates | 4 | 24 |
| Gradient clips checked | 4 | 24 |
In the code (rl_weight_agent.py:1175-1206):
for epoch in range(n_epochs): # 6 passes
indices = np.random.permutation(buf_len) # reshuffle each time
for start in range(0, buf_len, batch_size): # 4 mini-batches of 64
# ... compute loss, backpropagate, clip gradients, update weights5. Repeat from step 1 until total_timesteps is reached
| Algorithm | Problem | Why not for this use case |
|---|---|---|
| DQN | Only works for discrete actions | Cannot output 12 continuous portfolio weights |
| A2C | No clipping — unstable with small data | The backtest has ~50 decisions per episode; cannot afford instability |
| TRPO | Trust region via constrained optimization — slow | Requires computing the Fisher Information Matrix; computationally expensive |
| PPO | Clipping approximates trust region cheaply | Fast, stable, works for both discrete (regime) and continuous (weights) |
| SAC | Off-policy — needs replay buffer, more complex | PPO's on-policy simplicity is sufficient for this problem scale |
PPO hits the sweet spot: almost as stable as TRPO, almost as simple as A2C, and works for both agents in the hierarchy.
The system operates as a principal-agent hierarchy:
-
The Regime Agent (principal) observes the macro state (25 dimensions: VIX, SPY momentum, drawdowns, ML probability, etc.) and outputs a discrete regime decision (RISK_ON / RISK_REDUCED / DEFENSIVE)
-
The regime decision is encoded as a one-hot vector and prepended to the Weight Agent's observation
-
The Weight Agent (subordinate) observes the full state (103 dimensions: regime encoding + per-asset signals + portfolio state) and outputs 12 continuous weights via softmax
-
During training, the Regime Agent is frozen (pre-trained) while the Weight Agent learns. During inference, both run in sequence: regime first, then weights conditioned on that regime
-
In the current production configuration, the Regime Agent's learned policy is bypassed in favor of a simple rule (SPY > 200-SMA = RISK_ON) because the learned regime policy shows a 57.1% RISK_REDUCED / 36.2% RISK_ON / 6.8% DEFENSIVE distribution — too cautious compared to the rule-based baseline's 84.2% RISK_ON. The Weight Agent still operates, preserving the benefits of learned allocation while using the more reliable rule-based regime signal.
PPO is an on-policy actor-critic algorithm that stabilizes policy gradient updates via a clipped surrogate objective. At each iteration, the system collects a rollout of experience, computes GAE advantages (a bias-variance balanced estimate of action quality), then performs multiple epochs of mini-batch gradient descent on the clipped loss. The clipping mechanism —