![paper screen](img/paper_screen.png)

# Plan

* Markov decision process terminology and Q-learning
* Problem-specific notation
* Q-knn

# Markov decision process

## Notation

* $Z_T$ - state-time vector space, $[0, T] × E$ - state space;
* $A_z \in \mathbb{A}$ - market maker control, we are seeking, consists of actions $\alpha_t$; $mathbb{A}$ is the set of all admissible strategies;
* $\lambda$ - intensity of the jump;
* $Q$- transitions kernel;
* $r$ - reward;

The process is Markov, because current state of the system is completely described by the state vector $z_t$ and depends only on the previous state and action taken.

Our goal is to maximize the value function of Markov decision process.

# Actions and rewards in Q-learning

For each state of the system $z_t$ we can select an action from a space $A_z$.

By choosing an action $\alpha_z$, we obtain a reward $r$. 

Hence, there is a reward function $R: Z_t \times A_z$, which defines the rewards, associated with each action at each state.

![q_learning](img/q_learning.png)

# Value function, reward function, action-value function (Q-function)

Value function is the estimate of the current position at the moment of time $t$ and state $z$. It represents the expectation of maximum reward you can attain, if you play your cards optimally.

$V(t, z) = \sup_{\alpha \in \mathbb{A}} \mathbb{E} \lbrack \int \limits_{t}^{T} R_s(Z_s) ds \rbrack$, where $R_s: A_z \times Z_s \to \mathbb{R}$ is the reward that can be attained at the moment $s$, when the system is in a state $Z_s$.

Let us decompose the reward function $R_s$ into instantaneous reward $f(\alpha_s, Z_s)$ at step $s$ and terminal reward $g(Z_T)$: $R_s = f(\alpha_s, Z_s) ds + g(Z_T)$.

$V(t, z) = \sup_{\alpha \in \mathbb{A}} \mathbb{E} \lbrack \int \limits_{t}^{T} f(\alpha_s, Z_s) ds + g(Z_T) \rbrack$

Let us explicitly reflect the fact that Value function depends on the action we take at each moment $s$ according to our policy/strategy.

Introduce $Q_t(z, \alpha)$ an action-value function, relfecting the terminal retrun we get upon entering the state.

Optimal strategy $A_s \in \mathbb{A}$ produces the optimal action-value function $Q^*_t(z, \alpha)$, such that $Q^*_t(z, \alpha) = V(t, z)$:

$Q^*_t(z_t, \alpha_t) = f(\alpha, z) + \gamma \mathbb{E}_{p(z_{t+1} | z_t, \alpha_t)} \lbrack \max_{\alpha_{t+1}} Q^*_{t_0}(z_{t+1}, \alpha_{t+1}) | z_t, \alpha_t \rbrack$ , where $\gamma$ is a discount factor for delayed reward.

This recursion is Bellman's equation.

# Q-learning

![q-learning 2](img/q_learning2.png)

Here epsilon-greedy strategy is a strategy of choice of actions, based on a balance of exploration and exploitation, which exploits the optimal action, found so far in most cases and explores new options with probability $\epsilon$.

# State

$Z_t = X_t, Y_t, a_t, b_t, na_t, nb_t, pa_t, pb_t, ra_t,rb_t$, where:

* $X_t$ - cash held by marketmaker
* $Y_t$ - inventory of marketmaker
* $a_t = (a_1, ..., a_K)$ - bbo ask levels 1..K
* $b_t = (b_1, ..., b_K)$ - bbo bid levels 1..K
* $na_t = (na_1, ..., nb_K)$ - ranks of marketmaker ask orders in the respective level's order queue
* $nb_t = (nb_1, ..., nb_K)$ - ranks of marketmaker bid orders in the respective level's order queue
* $pa_t$ - bbo ask price
* $pb_t$ - bbo bid price
* $ra_t = (ra_1, ..., ra_K)$ - market maker ask order positions
* $rb_t = (rb_1, ..., rb_K)$ - market maker bid order positions

![orders](img/orders.png)

# k-Nearest Neighbours regression

A simple non-parametric method for regression/classification:

![knn](img/knn.png)

# Algorithm

![algorithm](img/algorithm.png)