# The Simple Probabilistic Model

The simple probabilistic model (SPM) is the first model used in our thesis. It is a discretization of the model presentented in Chapter 10.2 in Cartea et. al's book _Algorithmic and High-Frequency Trading_.

It can be summarized as follows (for the full definition, see our [report](deadlink)):
* The time _t_ can take values between 0 and _T_
* The midprice S<sub>t</sub> is a Brownian motion rounded to the closest tick
* The market maker can put the bid and ask depths at _d_ different levels, from 0 to _d_ - 1 ticks away from the mid price
* The cash process X<sub>t</sub> denotes the market makers cash at time _t_
* The inventory process Q<sub>t</sub> denotes the market makers inventory at time _t_
* At time _t_ the market maker is forced to liquidate its position

The _tick_ is the smallest tradeable unit of the underlying, for instance $0.01 of AAPL.

Based on this definition, an analytically optimal strategy can be defined, with which we want to compare strategies derived with Q-learning. An example of the optimal bid depths for a specific set of model parameters is shown in the figure below. **Note** that these depths are _not_ discreteized.

![OptimalBidDepths](images/ContinuousBid30.png)


# The Q-learning

After that short introduction, it's time for some reinforcement learning in the form of Q-learning.

We start by importing the needed files.

In [1]:
# import the Q-learning file for the simple probabilistic model
from simple_model_evaluation import *




Now we have to decide on the parameters we want to use for the environment and the hyperparameters we want to use for the Q-learning.

In [2]:
model_params = {
                "d": 4,         # the number of different depths that can be chosen from
                "T": 20,        # the length of the episode
                "dp": 0.01,     # the tick size
                "min_dp": 0,    # the minimum number of ticks from the mid price that is allowed to put prices at
                "phi": 1e-4     # the running inventory penalty
}


_\_start_ indicates the starting value of the parameter,
_\_end_ indicates the final value of the parameter,
_\_cutoff_ indicate where the final value is reached, i.e. 0.5 mean after 50% of the training.

In [3]:
Q_learning_params = {
        # epsilon-greedy values (linear decay)
        "epsilon_start": 1,
        "epsilon_end": 0.05,
        "epsilon_cutoff": 0.5,

        # learning-rate values (exponential decay)
        "alpha_start": 0.5,
        "alpha_end": 0.001,
        "alpha_cutoff": None,

        # exploring starts values (linear decay)
        "beta_start": 1,
        "beta_end": 0.05,
        "beta_cutoff": 0.5,
        "exploring_starts": True
}

hyperparams = {
        "n_train" : 1e5,
        "n_test" : 1e4,
        "n_runs" : 4
}

Finally we decide where to save our results.

In [4]:
# naming the folder where the results will be saved
folder_mode = True
folder_name = "spm_example"
save_mode = True

We're now ready for the Q-learning!

In [5]:
Q_learning_comparison(
    **hyperparams,
    args=model_params,
    Q_learning_args=Q_learning_params,
    folder_mode = folder_mode,
    folder_name = folder_name,
    save_mode = save_mode
)

RUN 1 IN PROGRESS...
	Episode 20000 (20%), 0:02:15.900000 remaining of this run
	Episode 40000 (40%), 0:01:31.910000 remaining of this run
	Episode 60000 (60%), 0:00:55.660000 remaining of this run
	Episode 80000 (80%), 0:00:26.240000 remaining of this run
	Episode 100000 (100%), 0:00:00 remaining of this run
THE FOLDER spm_example ALREADY EXISTS
...FINISHED IN 0:02:06.430000
0:06:19.300000 REMAINING OF THE TRAINING
RUN 2 IN PROGRESS...
	Episode 20000 (20%), 0:02:24.510000 remaining of this run
	Episode 40000 (40%), 0:01:40.060000 remaining of this run
	Episode 60000 (60%), 0:01:01.650000 remaining of this run
	Episode 80000 (80%), 0:00:29.790000 remaining of this run
	Episode 100000 (100%), 0:00:00 remaining of this run
THE FOLDER spm_example ALREADY EXISTS
...FINISHED IN 0:02:26.520000
0:04:53.050000 REMAINING OF THE TRAINING
RUN 3 IN PROGRESS...
	Episode 20000 (20%), 0:02:18.700000 remaining of this run
	Episode 40000 (40%), 0:01:35.560000 remaining of this run
	Episode 60000 (60%),

# Evaluating the strategies

We can now have a look at the images that were saved when running _Q\_learning\_comparison_.

Let's first have a look at the reward and the state-value at (0,0) during training.

![TrainingGraph](results/simple_model/spm_example/results_graph.png)

In this image it looks like that the Q-learning has converged, however, it has not. It has to be trained for much longer.

We can also have a look the learnt strategies. The figure below shows the learnt bid depths.

![Q-learningBidDepths](results/simple_model/spm_example/opt_bid_strategy.png)

We can also compare the average rewards of the Q-learning strategies versus benchmarking strategies. These are displayed in the boxplot below.

![BoxplotsRun](results/simple_model/spm_example/box_plot_benchmarking.png)

There are a lot more figures and tables to explore, visit the [spm_example](https://github.com/KodAgge/Reinforcement-Learning-for-Market-Making/tree/main/code/results/simple_model/spm_example) folder.
