![DSME-logo](./img/DSME_logo.png)

#  Reinforcement Learning and Learning-based Control

<p style="font-size:12pt";> 
<b> Prof. Dr. Sebastian Trimpe, Dr. Friedrich Solowjow </b><br>
<b> Institute for Data Science in Mechanical Engineering (DSME) </b><br>
<a href = "mailto:rllbc@dsme.rwth-aachen.de">rllbc@dsme.rwth-aachen.de</a><br>
</p>

---

# Demo for Exercise 1: Markov Decision Processes 

This Jupyter notebook accompanies Exercise 1 (Task 1) of the Reinforcement Learning and Learning-based Control in SS 23.
By loading it in Jupyter or Jupyter Lab you can rerun it on your own and also modify it.

In [11]:
import numpy as np

We first define the initial state $S_{t=0}$ and the transposed transition matrix $P^T$:print('P_transpose:', P_transpose)

In [12]:
S_0 = np.array([1, 0 , 0])
P_transpose = np.array([[0.2, 0.5, 0.7], [0.6, 0, 0], [0.2, 0.5, 0.3]])
print('P_transpose:')
print(P_transpose)

P_transpose:
[[0.2 0.5 0.7]
 [0.6 0.  0. ]
 [0.2 0.5 0.3]]


Now we compute the state at time step $t=1$ and $t=2$.

In [13]:
# compute S_1
S_1 = np.dot(P_transpose, S_0)
print("S_1: ", S_1)

# compute s_2
S_2 = np.dot(P_transpose, S_1)
print("S_2: ", S_2)

S_1:  [0.2 0.6 0.2]
S_2:  [0.48 0.12 0.4 ]


We examine the state of the system after 500 time steps starting in two different starting states $S_0$ and $\hat{S}_0$.

In [14]:
# State after 500 time steps
S_t = S_0
for t in range(500):
    S_t = np.dot(P_transpose, S_t)
print('state after 500 timesteps, starting in S_0: ', S_t)

# Now start in a different start state S_0 = [0,0,1]. In which state are we after 500 timesteps?
S_0_hat = np.array([0, 0, 1])
S_t = S_0_hat
for t in range(500):
    S_t = np.dot(P_transpose, S_t)
print('state after 500 timesteps, starting in S_0_hat: ', S_t)

state after 500 timesteps, starting in S_0:  [0.43209877 0.25925926 0.30864198]
state after 500 timesteps, starting in S_0_hat:  [0.43209877 0.25925926 0.30864198]


By iterating 500 timesteps we saw that the system converges to a stationary state distribution $S_{\infty}$. We can also find this state distribution as the eigenvector $v$ which solves the system $\lambda v = P^T v$ with the eigenvalue $\lambda = 1$.

In [15]:
eigval, eigvec = np.linalg.eig(P_transpose)
eigvec_norm = np.abs(eigvec[:,0]) # get first eigenvector
eigvec_norm = eigvec_norm / np.sum(eigvec_norm) # normalize such that sum_i P_ij =1
print('stationary state distribution: ', eigvec_norm)

stationary state distribution:  [0.43209877 0.25925926 0.30864198]
