# &#x1F4D1; &nbsp; <span style="color:red"> Reflections. Introduction To Reinforcement Learning. Lessons 9-10</span>

##   &#x1F916; &nbsp; <span style="color:red">Links & Libraries</span>

Reinforcement Learning: An Introduction http://incompleteideas.net/sutton/book/bookdraft2016aug.pdf

POMDP tutorials https://www.techfak.uni-bielefeld.de/~skopp/Lehre/STdKI_SS10/POMDP_tutorial.pdf

POMDP solution methods https://www.techfak.uni-bielefeld.de/~skopp/Lehre/STdKI_SS10/POMDP_solution.pdf

Decision Making under Uncertainty. MDPs and POMDPs http://web.stanford.edu/~mykel/pomdps.pdf

POMDPs https://people.eecs.berkeley.edu/~pabbeel/cs287-fa13/slides/pomdps.pdf

Exact Solutions of Interactive POMDPs Using Behavioral Equivalence
https://www.cs.uic.edu/~piotr/papers/aamas06.pdf

Partially observable Markov decision process https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process

A Survey of POMDP Solution Techniques https://www.cs.ubc.ca/~murphyk/Papers/pomdp.pdf

A Survey of POMDP Applications http://www.pomdp.org/papers/applications.pdf

In [2]:
from IPython.core.display import Image

##  &#x1F916; &nbsp;  <span style="color:red"> Lesson 9. Partially Observable MDPs</span>

A **POMDP** models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

Formally, a POMDP is a 7-tuple ${\displaystyle (S,A,T,R,\Omega ,O,\gamma )}$, where

- ${\displaystyle S}$ is a set of states,
- ${\displaystyle A}$ is a set of actions,
- ${\displaystyle T}$ is a set of conditional transition probabilities between states,
- ${\displaystyle R:S\times A\to \mathbb {R} }$ is the reward function.
- ${\displaystyle \Omega }$  is a set of observations,
- ${\displaystyle O}$ is a set of conditional observation probabilities, and
- ${\displaystyle \gamma \in [0,1]}$ is the discount factor.

At each time period, the environment is in some state ${\displaystyle s\in S}$. 

The agent takes an action ${\displaystyle a\in A}$, which causes the environment to transition to state ${\displaystyle s'}$ with probability ${\displaystyle T(s'\mid s,a)}$. 

At the same time, the agent receives an observation ${\displaystyle o\in \Omega }$  which depends on the new state of the environment with probability ${\displaystyle O(o\mid s',a)}$. 

Finally, the agent receives a reward equal to ${\displaystyle R(s,a)}$. 

Then the process repeats. 

The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward: ${\displaystyle E\left[\sum _{t=0}^{\infty }\gamma ^{t}r_{t}\right]}$ ^. 

The discount factor ${\displaystyle \gamma }$ determines how much immediate rewards are favored over more distant rewards. When ${\displaystyle \gamma =0}$ the agent only cares about which action will yield the largest expected immediate reward; when ${\displaystyle \gamma =1}$ the agent cares about maximizing the expected sum of future rewards.

#### POMDP ≡ Continuous-Space Belief MDP (don’t get to observe the state itself, instead get sensory measurements)

- A belief state is a distribution over states; in belief state b, probability b(s) is assigned to being in s
- Policies in POMDPs are mappings from belief states to actions

**Computing belief states**

- Begin with some initial belief state b prior to any observations
- Compute new belief state b' based on current belief state b, action a, and observation o
- b'(s') = P(s'| o, a, b)
- b'(s') ∝ P(o | s', a, b)P(s'| a, b)
- b'(s') ∝ O(o | s', a)P(s'| a, b)
- b'(s') ∝ O(o | s', a) ${\sum_s}$ P(s'| a, b, s)P(s | a, b)
- b'(s') ∝ O(o | s, a) ${\sum_s}$ T(s'| s, a)b(s)
- Kalman filter: exact update of the belief state for linear dynamical systems
- Particle filter: approximate update for general systems
- For this case only considering discrete state problems

- **Exact solution algorithms**
  - for general POMDP it is PSPACE-hard
  - value iteration
  -  policy iteration: policy evaluation and policy improvement
  
- **Offline methods**
  - Offline POMDP methods involve doing all or most of the computing prior to execution
  - Practical method generally find only approximate solutions
  - Some methods output policy in terms of alpha-vectors, others as finite-state controllers
  - Some methods involve leveraging a factored representation
  - Very active area of research
  - Examples:
    - QMDP value approximation
    - Fast informed bound (FIB) value approximation
    - Point-based value iteration methods
  
- **Online methods**
  - Online methods determine the optimal policy by planning from the current belief state
  - Many online methods use a depth-first tree-based search up to some horizon
  - The belief states reachable from the current state is typically small compared to the full belief space
  - Time complexity is exponential in the search depth
  - Heuristics and brand-and-bound techniques allow search space to be pruned

- Industrial Applications
  - Machine Maintenance
  - Structural Inspection
  - Elevator Control Policies
  - Fishery Industry
- Scientific Applications
  - Autonomous Robots
    - interplanetary rovers
    - deep-space navigation
    - bomb disposal
    - land-mine clearing
    - toxic waste clean-up
    - radioactive material handling
    - deep-ocean exploration
    - sewage/drainage network inspection and repair
  - Behavioral Ecology
  - Machine Vision
- Business Applications
  - Distributed Database Queries
  - Marketing
  - Questionnaire Design
  - Corporate Policy
- Military Applications
  - Moving Target Search and Identification
  - Rescue
  - Weapon Allocation
- Social Applications
  - Development of Teaching Strategies
  - Health Care Policymaking
  - Advanced Medical Diagnosis

##  &#x1F916; &nbsp;  <span style="color:red"> Lesson 10. </span>