<h1><center> Lab Session n¬∞2 : RL for stochastic control problems  </center></h1>



<h2> üìå Objectives: </h2>

In the **first part** of the **lab session**, you will solve two stochastic control problems in finance,  the Markowitz portfolio optimization problem and the market impact problem.  You will explore and implement the algorithms introduced in the lecture using Python and its scientific libraries.

The **second part** of the **lab session** will be devoted to mathematical questions on solving linear quadratic control problems.


<h2>üìö Goal of the Lab: </h2>

By the end of this lab, you will be able to:

- Undertand and implement some **reinforcement algorithms** to solve stochastic control problems arising in finance.

- Derive explicitly the **optimal policy** and **value function** arising in the **Linear quadratic** control problems in continuous time.


<h2> üóÇÔ∏è Lab Structure and assignments: </h2>

This notebook is organized into the following sections:


**1. [On the Markowitz problem](#Markowitz-Applications)**  

&nbsp;&nbsp;&nbsp;&nbsp;1.1 [Problem formulation](#Markowitz-problemformulation)

&nbsp;&nbsp;&nbsp;&nbsp;1.2 [Optimal Q-function](#Markowitz-Q-function)



**2. [On the Market impact problem](#MarketImpact-Applications)**

&nbsp;&nbsp;&nbsp;&nbsp;2.1 [Problem formulation](#MarketImpact-problemformulation)

&nbsp;&nbsp;&nbsp;&nbsp;2.2 [Optimal Q-function](#MarketImpact-Q-function)


**3. [Mathematical questions](#Mathematical-Questions)**

&nbsp;&nbsp;&nbsp;&nbsp; [Linear quadratic case in continuous time](#Mathematical-Questions-LQ-case)




**4. [References](#references)**  


This lab will include **mathematics** and/or **coding** questions indicated by ‚ùì. **Your answers** indicated by ‚úèÔ∏è will count for your final grade of the course, with a weight to be determined later with respect to the project. Note that the project will have a significant higher weight in the final grade.


**Mathematics Questions**

- You can answer directly in the **Jupyter notebook** using LaTeX (compatible with Markdown).


**Coding Questions**

-  Complete the corresponding code sections **directly in the notebook**.
-  **Code readability**, **quality**, and **clarity of comments** will be taken into account in the **grading**.


If you choose this lab, you will have to send your work by e-mail at [samy.mekkaoui@polytechnique.edu](mailto:samy.mekkaoui@polytechnique.edu). The submission deadline will be announced later during the course.



<h2>‚ÑπÔ∏è Other informations: </h2>




- **Key References**: If you want to go deeper on the use of RL methods for solving stochastic control problems in finance, you can look at the section [References](#references). <br> <br>



- **Contact**: If you find any mistakes in this notebook, or have any other feedback or questions, please feel free to e-mail me at [samy.mekkaoui@polytechnique.edu](mailto:samy.mekkaoui@polytechnique.edu).


<a id=Markowitz-Applications></a>

<h1> <center> 1: The Markowitz Problem  </center> </h1>



<a  id=Markowitz-problemformulation></a>

<h2> 1.1 Problem formulation : </h2>





In [None]:
# Import Packages

import numpy as np
import matplotlib.pyplot as plt
import torch as nn


<a id=Markowitz-Q-function></a>

<h2> 1.2 Optimal Q-function : </h2>

<a id=MarketImpact-Applications></a>

<h1> <center> 2: On the Market Impact Problem  </center> </h1>



<a  id=MarketImpact-problemformulation></a>

<h2> 2.1 Problem formulation : </h2>





<a id=MarketImpact-Q-function></a>

<h2> 2.2 Optimal Q-function : </h2>

<a id=MarketImpact-Applications></a>




<a id=Mathematical-Questions></a>
<h1> <center> 3. Mathematical Questions </center> </h1>

We recall that given a policy $\pi = (\pi_s)_{t \leq s \leq T}$, i.e., a $\mathcal{P}(A)$-valued map,  the value function $V^{\pi}$ of the stochastic control problem is defined as

$$
\begin{align}
V^{\pi}(t,x) &= \mathbb{E}_{\pi} \Big[ \int_t^T f(s,X_s,\pi_s) ds + \lambda \mathcal{E}(\pi(s,X_s))  g(X_T) \big | X_t = x \Big], \\
&= \mathbb{E}_{\pi} \Big[ \int_t^T f(s,X_s,\pi_s) ds - \lambda \text{log} (\pi(s,X_s,\alpha_s))  g(X_T) \big | X_t = x \Big]
\end{align}
$$
where the controlled state process is given for $\alpha=(\alpha_s)_{t \leq s \leq T} \sim \pi$ by 
$$
\begin{align}
\begin{cases}
dX_s &= b(X_s, \alpha_s) ds + \sigma(X_s, \alpha_s) dW_s, \quad s \in [t,T], \notag \\
X_t &= x, \notag
\end{cases}
\end{align}
$$
and the optimal value function is given by

$$
\begin{align}
V(t,x) &= \sup_{\pi} V^{\pi}(t,x). \notag 
\end{align}
$$



  ‚ùì **Question 3.1**:  Given a policy $\pi$, recall the Bellman relation for $V^{\pi}$ and the Bellman optimality principle for $V$.

  ‚ùì **Question 3.2**:  Apply It√¥'s formula to the process $V^{\pi}(s,X_s)$ between $t$ and $t+h$ for $h > 0$ and show that $V^{\pi}$ satisfies the   following linear PDE:

$$
\begin{align}
\begin{cases}
\frac{\partial V^{\pi}}{\partial t} (t,x) + \int_{A} \big[ H(x,a, \nabla_x V^{\pi}(t,x), D_x^2 V^{\pi}(t,x)) - \lambda \text{log}(\pi(t,x,a)) \big] \pi(t,x,a)\nu(da) &= 0,\notag \\
V^{\pi}(T,x) &= g(x),
\end{cases}
\end{align}
$$
where the map $H$ is defined as
$$
\begin{align}
H(x,a,p,M) = b(x,a) \cdot p + \frac{1}{2} \text{tr}(\sigma \sigma^{\top}(x,a) M) + f(x,a), \notag
\end{align}
$$
for $x \in \mathbb{R}^d$, $a \in A$, $p \in \mathbb{R}^d$, and $M \in \mathbb{R}^{d \times d}$.


‚ùì **Question 3.3**:  Deduce the Bellman equation satisfied by the optimal value function $V$:

$$
\begin{align}
\begin{cases}
\frac{\partial V}{\partial t} (t,x) + \underset{\pi \in \mathcal{P}(A)}{\text{ sup }} \int_{A} \big[ H(x,a, \nabla_x V^{\pi}(t,x), D_x^2 V^{\pi}(t,x)) - \lambda \text{log}(\pi(t,x,a)) \big] \pi(t,x,a)\nu(da) &= 0,\notag \\
V(T,x) &= g(x),
\end{cases}
\end{align}
$$

‚ùì **Question 3.4**:  Recall from the course the form of the optimal randomized policy $\pi^{\star}$ in terms of the value function $V$ and show that it leads to the following form of the Bellman equation for $V$:

$$
\begin{align}
\begin{cases}
\frac{\partial V}{\partial t} (t,x) + \lambda \text{ log } \bigg[ \int_{A} \text{ exp } (   \frac{1}{\lambda} H(x,a, D_x V(t,x)),D^2_x V(t,x)) \nu(d a ) \bigg] &= 0,\notag \\
V(T,x) &= g(x),
\end{cases}
\end{align}
$$



<a id=Mathematical-Questions-LQ-case></a>

<h2> Linear quadratic case in continuous time : </h2>



Suppose that the coefficients of the state dynamics and the reward functions are given by
$$\begin{align}
\begin{cases}
b(x,a) &= Bx + Ca, \notag \\
\sigma(x,a) &= Dx + Fa, \notag \\
f(x,a) &=  x^{\top} Q x + a^{\top} N a , \notag \\
g(x) &= x^{\top} P x, \notag
\end{cases}
\end{align}
$$
with $x \in \mathbb{R}^d$, $a \in \mathbb{R}^m$, and where $B$, $C$, $D$, $F$, $Q$, $N$, and $P$ are matrices of appropriate dimensions, with $Q$, $N$, and $P$ symmetric positive definite.


‚ùì **Question 3.5**:  Make the ansatz that the optimal value function is quadratic in the state variable, ie., of the form

$$
\begin{align}
V(t,x) = x^{\top} K(t) x + R(t), \notag 
\end{align}
$$
for some deterministic functions $K : [0,T] \rightarrow \mathbb{S}_{+}^d$ and $R : [0,T] \rightarrow \mathbb{R}$. 

- Show that the map $H$ is given by the form in the course (Slide 59 of Lecture 1).
- Show using Question 3.1.4 that the Bellman equation for $V$ is satisfies if $K$ and $R$ satisfy the system of ODEs given in the course (Slide 60 of Lecture 1).
- Show that the optimal feedback control policy $\pi^{\star}$ is given by a Gaussian distribution with mean and covariance matrix given in the course (Slide 60 of Lecture 1). Discuss the impact of $\lambda$ on the optimal policy and compare it to the case $\lambda = 0$ (without entropy regularization).



<a id=Mathematical-Questions></a>


<a id=references></a>
<h2> <center> 4. References   </center> </h2>





- R. Sutton and A. Barto: Introduction to reinforcement learning, second edition 2016, available [here](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf).

- Y. Jia and X.Y. Zhou: Policy gradient and Actor-Critic learning in continuous time and space: theory and algorithms, 2022, Journal of Machine Learning and Research. available [here](https://arxiv.org/abs/2111.11232).

-  Y. Jia and X.Y. Zhou: q-Learning in continuous time, 2023, Journal of Machine Learning and Research. available [here](https://arxiv.org/abs/2207.00713).



