The caching algorithm in this repo is based on Online Linear Optimization. Here is a small explainer with missing details.

# Online Linear Optimization (OLO)

## Static Regret
Online Linear Optimization (OLO) is a framework where, at each round, a decision maker selects an action from a convex set before a linear loss function is revealed. The goal is to minimize the cumulative loss compared to the best fixed action in hindsight—a difference known as **static regret**.

In OLO, the **static regret** after $T$ rounds is defined as
$$
\text{Regret}_T = \sum_{t=1}^T \langle g_t, x_t \rangle - \min_{x \in K} \sum_{t=1}^T \langle g_t, x \rangle,
$$ 

where $x_t \in K$ is the decision at round $t$, $g_t$ is the gradient (or loss vector) revealed at round $t$ , and $K$ is the convex decision set.


## Online Projected Gradient Descent (OPGD)
A popular algorithm in this setting is **Online Projected Gradient Descent (OPGD)**. At each iteration, the algorithm updates the current decision by stepping in the direction of the negative gradient of the loss and then projects back onto the feasible set to maintain constraints. This approach leverages the convexity of the decision set and the linearity of losses, offering a simple yet effective method for minimizing static regret over time.


The update rule for OPGD is

$$
x_{t+1} = \Pi_K\Bigl( x_t - \eta\, g_t \Bigr),
$$ 

where $\eta > 0$ is the learning rate, and $\Pi_K(y) = \arg\min_{x \in K} \|x - y\|$   denotes the projection of $y$  onto the set $K$.

## Regret bound for OPGD
A central result in OLO is that **OPGD achieves a regret bound on the order of $\sqrt{T}$** for a well chosen $\eta$ when $K$ is bounded and convex, meaning that

$$
\text{Regret}_T = O(\sqrt{T}).
$$ 

This $\sqrt{T}$ bound is important because it implies that the **average regret per round**, $\frac{\text{Regret}_T}{T}$, tends to zero as $T$ increases, ensuring that the algorithm performs nearly as well as the best fixed decision in hindsight even against adversarial losses.

The key intuition behind this result is that, with an appropriately chosen learning rate $\eta$, the incremental loss incurred by each OPGD update can be controlled through the properties of convexity and the geometry of the projection step. By carefully balancing the step size and the accumulated errors (using a telescoping sum argument), one shows that the total deviation from the best fixed action does not exceed a term proportional to $\sqrt{T}$. This makes OPGD particularly effective in the online setting, where decisions must be made sequentially without prior knowledge of future loss functions.


## Sources OLO
- o3 mini (
write a text introducing OLO online linear optimization, keep it short introduce static regret and online projected gradient descent, explain that an important result or the essence why OLO works is that OPGD obtains sqrt T regret
)

- Hazan: http://arxiv.org/abs/1909.05207 (Introduction to Online Convex Optimization)

- Ashoks Thesis: https://www-cs.stanford.edu/people/ashokc/papers/thesis.pdf (PhD on PARAMETER-FREE ONLINE
LEARNING )

- Orabona:  http://arxiv.org/abs/1912.13213 (A Modern Introduction to Online Learning)

- convex set wiki: https://en.wikipedia.org/wiki/Convex_set

# OLO framework for caching


Fractional caching can be analyzed as a OLO problem. 

## Representing fractional caching strategy
Let $y$ be a vector where each component $y_{i}$ represents the fraction of item $i$ that is cached,  the set of feasibility for y is (for a cache of size $C$): 

$$
\mathcal{Y} = \left\{ y \in [0,1]^N \middle| \sum_{n=1}^N y^{n} \leq C \right\},
$$ 

as there always exist a full cache strategy that outperforms a non full cache strategy all our fractional cache states will be in

$$
\mathcal{Y_{full}} = \left\{ y \in [0,1]^N \middle| \sum_{n=1}^N y^{n} = C \right\}.
$$ 
Note that it is essential that the feasibility set is convex, for integral caching this is not the case so OLO isn't directly applicable.


## Optimal Factional Static Cache 

To understand how fractional caching is an OLO problem, you can also understand how you can frame finding the optimal fractional static cache in hindsight as a constrained convex linear optimization problem or in this case linear programming problem because 
$\mathcal{Y}$ is a linear constraint.

To pose a optimization problem we should chose our goal, loss/utility function. In the case of caching our goal is to minimize the miss rate or equivalently maximize the hit rate. We define the hit rate for a sequence of requests $l_i \in \mathbb{N}$ ($l_{t}$ is the number of the $t$ th requested item) of length $T$ for a static cache $y \in Y_{full}$ as:

$$
H_{T}(y) = \frac{1}{T}\sum_{t=1}^{T} y^{l_t} .
$$ 
to keep the $g_{t}$ the same for different $T$ later we ignore $\frac{1}{T}$ and added it later in doing this preserves the optimization problem, to make this similar to the static regret definition which uses minimization we are going to minimize the negative hit rate and use basis vectors $e^{l_{t}}$ to select component $l_{t}$ of $y$ with the inner product, so finding the optimal fractional static cache in hindsight $y_{opt}$ is framed as following optimization problem:

\begin{align*}
y_{opt} &= \argmin_{y \in Y_{full}} \left(- \sum_{t=1}^{T} \langle e^{l_{t}} , y \rangle \right) \\
        &= \argmin_{y \in Y} \left( \sum_{t=1}^{T} \langle g_{t} , y \rangle \right) 
\end{align*}
with $g_{t} = - e^{l_{t}} $, the $g_{t}$ s are the gradients in our OLO caching problem. Now formulating caching as an OLO problem is obvious.

## OPGD for caching

Applying OPGD on our fractional caching formulation defines following online fractional caching strategy initialize with an arbitrary $y_{0}\in Y$ then:

\begin{align*}
y_{t+1} &= \Pi_Y \Bigl( y_t - \eta\, g_t \Bigr), \\
        &= \Pi_Y \Bigl( y_t + \eta\, e^{l_{t}} \Bigr) 

\end{align*}



this strategy comes with the regret guarantee of OPGD: 

\begin{align*}
O(\sqrt{T}) &= \sum_{t=1}^T \langle g_t, y_t \rangle -  \sum_{t=1}^T \langle g_t, y_{OPT} \rangle,  \Leftrightarrow \\
\frac{O(\sqrt{T})}{T} &= \frac{1}{T}\sum_{t=1}^T \langle g_t, y_t \rangle -  \frac{1}{T} \sum_{t=1}^T \langle g_t, y_{OPT} \rangle, \Leftrightarrow \\
\frac{O(\sqrt{T})}{T} &= - H_{T}(y_{t}) +  H_{T}(y_{OPT}), \Leftrightarrow \\
\frac{O(\sqrt{T})}{T} &=  H_{T}(y_{OPT})- H_{T}(y_{t})  ,  \\
\end{align*}

this means as $T\rightarrow \infty$ the hit rate of the online strategy converges to the hit rate of the static optimal in hindsight.


## Sources OLO framework for caching
- o3 mini (explain the OLO framework for fractional caching)
- paper Paschos: http://arxiv.org/abs/1904.09849 (introduces OGA $\approx$ OPGD for caching)
- wiki linear programming: https://en.wikipedia.org/wiki/Linear_programming


# fast OPGD for caching 

Throughput and latency are important considerations for caching algorithms. A caching algorithm can be impractical when incoming request arrive faster then the throughput so that the waiting time for handling each request grows unboundedly or when the latency introduced by computing and executing the dynamical cache takes longer then having no cache.   

## Effiency of fractional caching



## Sources fast OPGD for caching

- paper Carra: http://arxiv.org/abs/2405.01263 (introduces efficient implementations for fractional and integral caching)
- paper Salem: http://arxiv.org/abs/2101.12588 (departs from fractional caching to integral caching my a rounding sheme)

# Quantized Online Caching Descent (qOCD)



