Skip to content

Commit 7394176

Browse files
committed
Restore profit to observation list; fix risk-sensitive optimistic init value and narrative
1 parent 7c2f3de commit 7394176

File tree

2 files changed

+6
-5
lines changed

2 files changed

+6
-5
lines changed

lectures/inventory_q.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -471,8 +471,9 @@ All the manager needs to observe at each step is:
471471

472472
1. the current inventory level $x$,
473473
2. the order quantity $a$, which they choose,
474-
3. the discount factor $\beta$, which is determined by the interest rate, and
475-
4. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
474+
3. the resulting profit $R_{t+1}$ (which appears on the books),
475+
4. the discount factor $\beta$, which is determined by the interest rate, and
476+
5. the next inventory level $X_{t+1}$ (which they can read off the warehouse).
476477

477478
These are all directly observable quantities — no model knowledge is required.
478479

lectures/rs_inventory_q.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -556,8 +556,8 @@ The logic is the same — initialize the Q-table so that every untried action lo
556556

557557
Since the optimal policy *minimizes* $q$, "optimistic" means initializing the Q-table *below* the true values. When the agent tries an action, the update pushes $q$ upward toward reality, making that entry look worse and prompting the agent to try other actions that still appear optimistically good.
558558

559-
The true Q-values are on the order of $\exp(-\gamma \, v^*) \approx 10^{-5}$ to $10^{-4}$.
560-
We initialize the Q-table at $10^{-5}$, modestly below this range.
559+
The true Q-values are on the order of $\exp(-\gamma \, v^*) \approx 10^{-8}$ to $10^{-6}$.
560+
We initialize the Q-table at $10^{-9}$, modestly below this range.
561561

562562
### Implementation
563563

@@ -644,7 +644,7 @@ The wrapper function unpacks the model and provides default hyperparameters.
644644
```{code-cell} ipython3
645645
def q_learning_rs(model, n_steps=20_000_000, X_init=0,
646646
ε_init=1.0, ε_min=0.01, ε_decay=0.999999,
647-
q_init=1e-5, snapshot_steps=None, seed=1234):
647+
q_init=1e-9, snapshot_steps=None, seed=1234):
648648
x_values, d_values, ϕ_values, p, c, κ, β, γ = model
649649
K = len(x_values) - 1
650650
if snapshot_steps is None:

0 commit comments

Comments
 (0)