# Optimal Maintenance Problem

<img src="https://www.theholidayspot.com/christmas/images/symbols/chimney-sweep.jpg"/>

States: healthy, faulty $\mathcal{S}=\{0,1\}$

Actions: nothing, repair $\mathcal{A}=\{0,1\}$

If repair, then healthy, i.e.

$p(r=-10,s'=0|s=\forall,a=1)=1$

If nothing done and faulty, then faulty, i.e.

$p(r=-1,s'=1|s=1,a=0)=1$

If nothing done and heathy, then may get faulty

$
p(r=0,s'=0|s=0,a=0)=\alpha
$

$
p(r=0,s'=1|s=0,a=0)=1-\alpha
$

and we consider a general parameter $\gamma$. More on predictive maintenance you can find <a href="https://www2.humusoft.cz/www/papers/tcp11/019_berka.pdf"/>here</a>.

In [19]:
p = {}
#p[s,a]={(r,s'):P(r,s'|s,a)}
p[(0,1)]={(-10,0):1}
p[(1,1)]={(-10,0):1}
p[(1,0)]={(-1,1):1}
p[(0,0)]={(0,0):0.95,(0,1):0.05}
gamma = 0.9

# Policies and Value Functions

Return means the long-term reward, we consider it in a discounted form:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$$ 

Note that there is a relationship between $G_t$ and $G_{t+1}$

$$
G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1} = R_{t+1}+\sum_{k=1}^\infty \gamma^k R_{t+k+1}
= R_{t+1}+\sum_{k=0}^\infty \gamma^{k+1} R_{t+k+2} = R_{t+1}+\gamma\sum_{k=0}^\infty \gamma^k R_{t+k+2} = R_{t+1} + \gamma G_{t+1}
$$
We are interested in return $G_t$ for given $S_t$ when following a policy $\pi$ which we denote for all $s$ by **state-value function for policy $\pi$**:

$$
v_\pi(s) = \mathbb{E}_\pi[G_t|S_t=s]
$$

Similarly, we are interested in return for state $s$ and action $a$. This is denoted as **state-value function for policy $\pi$**:
$$
q_\pi(s,a) = \mathbb{E}_\pi[G_t|S_t=s,A_t=a]
$$
There is an important recursive relation for $v_{\pi}(s)$ (called **Bellman equation**):
$$
\begin{align}
v_{\pi}(s) && = && \mathbb{E}_{\pi}[G_t|S_t=s] \\
&& = && \mathbb{E}_{\pi}[R_t+\gamma G_{t+1}|S_t=s] \\
&& = &&\sum_a \pi(a|s)\sum_{s',r}p(s',r|s,a)\left( r+\mathbb{E}_{\pi}[\gamma G_{t+1}|S_{t+1}=s']\right) \\
&& = &&\sum_a \pi(a|s)\sum_{s',r}p(s',r|s,a)\left( r+\gamma v_{\pi}(s') \right) 
\end{align}
$$
Which can be represented by so called **backup diagram**
<img src="http://www.incompleteideas.net/book/ebook/figtmp10.png"/>
Source <a href="http://www.incompleteideas.net/book/ebook/figtmp10.png">http://www.incompleteideas.net/book/ebook/figtmp10.png</a>

**Question**

- Looking at the (b) part of diagram, what recursive relation does hold for action-state function $q_{\pi}(s,a)$?
- What is the value function for policy $\pi(0)=0$ and $\pi(1)=1$ in the optimal maintenance problem? Hint: Solve the Bellman equation!

## Optimal Policies and Value Functions
Considering the set of all policies, we can define a <a href="https://en.wikipedia.org/wiki/Partially_ordered_set">partial ordering</a> like this:
$\pi\geq \pi'$ if and only if $v_{\pi}(s)\geq v_{\pi'}(s)$ for all $s$. 

There might be multiple optimal strategies. All of them share the same value function:
$$
v^{*}(s)=\max_\pi v_\pi(s)
$$

**Questions**:

- Proof that the ordering is partial.
- Provide an example of an MDP where more than one strategies are optimal.

Similarly, we can define the optimal action-value function
$$
q^{*}(s,a)=\max_{\pi} q_{\pi}(s,a)
$$

**Question**:

- Do $q^{*}$ correspond to the same optimal policies as $v^{*}$?

Hint: $q^{*}(s,a)=\mathbb{E}[R_{t+1}+\gamma v^{*}(S_{t+1})|S_t=s,A_t=a]$

Bellman optimality equation:
$$
\begin{align}
v^{*}(s) && = && \max_a q_\pi^{*} (s,a)\\
&& = && \max_a \mathbb{E}_{\pi^{*}}[G_t | S_t=s,A_t=a]\\
&& = && \max_a \mathbb{E}_{\pi^{*}}[R_{t+1} + \gamma v^{*}(S_{t+1}) | S_t=s,A_t=a]\\
&& = && \max_a \sum_{s',r}p(s',r|s,a)[r+\gamma v^{*}(s')]\\
\end{align}
$$

Similarly:
$$
q^{*}(s,a) = \sum_{s',r}p(s',r|s,a)[r+\gamma\max_{a'}q^{*}(s',a')]
$$

In this case, we have backup diagrams like this:
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQMlEqT-T1HFB3THaiGWDCaIDkISr0dfp1GEzPVtgOmCaXno4wFWw"/>
Source <a href="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQMlEqT-T1HFB3THaiGWDCaIDkISr0dfp1GEzPVtgOmCaXno4wFWw">https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQMlEqT-T1HFB3THaiGWDCaIDkISr0dfp1GEzPVtgOmCaXno4wFWw</a>

These equations can be solved as system of nonlinear equations.

**Question**:

- Why non-linear?



## Optimality and Approximations

How to cope with high dimension of $\mathcal{A}$ and $\mathcal{S}$?

Question:

- In terms of memory?
- In terms of states that are being updated?

# Home Work

Obligatory:

- Define own Markov Decision Process - all considered elements in the definition.

Optional:

- Solve the Bellman optimality equations and determine that gamma analytically. Send the solution 24 before the lecture (LaTeX or scanned documents).

# Next Time - Dynamic Programming Methods

- How to evaluate $v_\pi$ iteratively.
- How to use that information to improve the policy.
- How to iterate these two steps.
- What are the other options.