# The Markov Chain Model

The Markov chain model is the second model used in our thesis. It is significantly more complex than the simple probabilistic model, however, it is still quite rudimentary. The model was developed by Hult and Kiessling in their paper *[Algorithmic trading with Markov Chains](https://www.researchgate.net/publication/268032734_ALGORITHMIC_TRADING_WITH_MARKOV_CHAINS)*.

In this model, the limit order book (LOB) is modelled explicitly. There are six event types:

> 1. Buy limit order 
> 2. Sell limit order
> 3. Cancel buy order
> 4. Cancel sell order
> 5. Buy market order
> 6. Sell market order

The arrival of an event results in a state transition in the Markov chain. The transition rates are described in our [report](deadlink). An example of how the arrival of different events affect the LOB is shown in the image below.

<div>
    <img src="images/LOBDynamics.png" width=800/>
</div>


Like in the simple probabilistic model we also have:

> * The time _t_ can take integer values between _0_ and _T_.
>
> * The market maker can put the bid and ask depths at *max\_quote\_depth* different levels, from _1_ to *max\_quote\_depth* ticks away from the best ask and best bid price respectively.
>
> * The cash process _X<sub>t</sub>_ denotes the market makers cash at time _t_.
>
> * The inventory process _Q<sub>t</sub>_ denotes the market makers inventory at time _t_.
>
> * The market maker can see the current time _t_ , its inventory _Q<sub>t</sub>_ and the *full LOB* before taking an action.
>
> * At time _t_ the market maker is forced to liquidate its position.

The _tick_ is the smallest tradeable unit of the underlying, for instance $0.01 of AAPL.

## The deep reinforcement learning

After that short introduction, it's time for some deep reinforcement learning in the form of DDQN.

We start by importing the needed files.

In [1]:
# import the Q-learning file for the markov chain model
from mc_model_mm_deep_rl_batch import *




Now we have to decide on the parameters we want to use for the environment and the hyperparameters we want to use for the DDQN.

There are some additional parameters in the Markov chain model, see the code snippet below for an explanation of them.

In [2]:
model_params = {
                "dt": 1,                    # the length of the time steps
                "T": 100,                   # the length of the episode
                "num_levels": 10,           # how many depth levels that should be included in the LOB
                "default_order_size": 5,    # the size of the orders the MM places
                "max_quote_depth": 5,       # how deep the MM can put its quotes
                "reward_scale": 0.1,        # a factor all rewards will be multiplied with
                "randomize_reset": True     # should a random LOB state  be chosen at the start of every episode?
}

We now have to decide which hyperparameter values we want to use.

In [3]:
hyperparams = {
                "n_train": int(1e5),    # the number of steps the agents will be trained for
                "n_test": int(1e2),     # the number of episodes the agents will be evaluated for
                "n_runs": 4             # the number of agents that will be trained
}

DDQN_params = {
                # network params
                "hidden_size": 64,                                          # the hidden size of the network
                "buffer_size": hyperparams["n_train"] / 200,                # the size of the experience replay bank
                "replay_start_size": hyperparams["n_train"] / 200,          # after how many number of steps the experience replay is started
                "target_update_interval": hyperparams["n_train"] / 100,     # how often the target network is updated
                "update_interval": 2,                                       # how often the online network is updated
                "minibatch_size": 16,                                       # the size of the minibatches used

                # epsilon greedy (linear decay)
                "exploration_initial_eps": 1,                               # the starting value of the exploration rate
                "exploration_final_eps": 0.05,                              # the final value of the exploration rate
                "exploration_fraction": 0.5,                                # when the final value is reached

                # learning rate
                "learning_rate_dqn": 1e-4,                                  # the learning rate used (Adam)
                
                # other params
                "num_envs": 10,                                             # how many parallelized environments
                "n_train": hyperparams["n_train"], 
                "n_runs": hyperparams["n_runs"],
                "reward_scale": model_params["reward_scale"],

                # logging params
                "log_interval": hyperparams["n_train"] / 100,               # the frequency of saving information
                "num_estimate": 10000,                                      # how many states that should be used for estimating q_values
                "n_states": 10                                              # the number of states heatmaps are averaged over
                
}

For this model it is the emulating that is the bottleneck, so it runs faster on a cpu than a gpu.

In [4]:
gpu = -1

Finally we decide where to save our results.

In [5]:
# naming the folder where the results will be saved
folder_name = "mc_deep_example"

outdir = "results/mc_model_deep/" + folder_name + "/"

We're now ready for the deep reinforcement learning!

In [6]:
train_multiple_agents_batch(
    DDQN_params, 
    model_params, 
    hyperparams["n_train"], 
    outdir, 
    hyperparams["n_runs"], 
    gpu=gpu
)

RUN 1 IN PROGRESS...
	Step 20000 (20%), 0:03:35.280000 remaining of this run
	Step 40000 (40%), 0:02:44.930000 remaining of this run
	Step 60000 (60%), 0:01:50.320000 remaining of this run
	Step 80000 (80%), 0:00:55.200000 remaining of this run
	Step 100000 (100%), 0:00:00 remaining of this run
...FINISHED IN 0:04:39.860000
0:13:59.580000 REMAINING OF THE TRAINING
RUN 2 IN PROGRESS...
	Step 20000 (20%), 0:03:29.910000 remaining of this run
	Step 40000 (40%), 0:02:44.750000 remaining of this run
	Step 60000 (60%), 0:01:50.650000 remaining of this run
	Step 80000 (80%), 0:00:55.420000 remaining of this run
	Step 100000 (100%), 0:00:00 remaining of this run
...FINISHED IN 0:04:38.600000
0:09:17.200000 REMAINING OF THE TRAINING
RUN 3 IN PROGRESS...
	Step 20000 (20%), 0:03:30.560000 remaining of this run
	Step 40000 (40%), 0:02:44.620000 remaining of this run
	Step 60000 (60%), 0:01:50.940000 remaining of this run
	Step 80000 (80%), 0:00:55.370000 remaining of this run
	Step 100000 (100%), 

# Evaluating the strategies

Now that the training is complete, we can now continue with evaluating the agents.

In [7]:
evaluate_DDQN_batch(
    outdir, 
    n_test=hyperparams["n_test"],                  
    Q=10,       # how many depths that should be displayed in the heatmaps
    randomize_start=model_params["randomize_reset"]
)

PLOTTING TRAINING...
PLOTTING STRATEGIES...
EVALUATING AGENTS...
EVALUATING BENCHMARKS...
	best agent
	mean agent
	constant strategy
	random_strategy
VISUALIZING THE STRATEGIES...



We can now have a look at the images that were saved when running *evaluate\_DDQN\_batch*.

Let's first have a look at the reward, the estimated state-value at (0,0) and the loss during training.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/training_graph.png"/>
</div>

In this image it looks like that the algorithm hasn't converged. Indeed, it has to be trained for much longer. It probably also needs hyperparameter tuning.

We can also have a look the learnt strategies. The figure below shows the learnt bid depths.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/bid_heat_randomized_10.png" width="500"/>
</div>

We can also compare the average rewards of the Q-learning strategies versus benchmarking strategies. These are displayed in the boxplot below.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/box_plot_benchmarking.png"/>
</div>

We can also view these results in table form.


In [8]:
f = open("results/mc_model_deep/mc_deep_example/image_folder/table_benchmarking")
print(f.read())
f.close()

strategy           mean reward    std reward    reward per action    reward per second
---------------  -------------  ------------  -------------------  -------------------
constant (d=1)         -0.0022     0.165654             -2.2e-05             -2.2e-05
random                 -0.0054     0.0662634            -5.4e-05             -5.4e-05
DDQN (best run)         0.0166     0.0549767             0.000166             0.000166
DDQN (mean)             0.0055     0.070461              5.5e-05              5.5e-05


We can also have a look at how the mean strategy and the individual strategies act. The figures below shows the average inventory, cash and value process of the different strategies.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/visualization_mean.png"/>
</div>

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/visualization_all.png"/>
</div>