# &#x1F4D1; &nbsp; <span style="color:red"> Reflections. Introduction To Reinforcement Learning. Lessons 1-2</span>

##   &#x1F916; &nbsp; <span style="color:red">Links</span>

#### Artificial Intelligence: A Modern Approach 
http://aima.cs.berkeley.edu/
#### Reinforcement Learning
http://reinforcementlearning.ai-depot.com/
#### Markov Decision Processes
http://www.cs.rice.edu/~vardi/dag01/givan1.pdf
#### Tools for Decision Analysis:
https://home.ubalt.edu/ntsbarsh/opre640a/PartIX.htm#rFromDaKno

##  &#x1F916; &nbsp;  <span style="color:red"> Lesson 1.  Smoov & Curly's Bogus Journey</span>

In [13]:
import math
# Actions: U - Upper, R - right, L - left, D - down
# U-U-R-R-R (5 times "the right way" with p=0.8) + 
# R-U-U-R-R(1 time "the right way" with p=0.8, 4 time "the possible way" with p=0.1)
p_r, p_p = 0.8, 0.1
math.pow(p_r, 5) + math.pow(p_p, 4)*p_r

0.3277600000000001

***Modeling for decision making*** involves two distinct parties, one is the decision-maker and the other is the model-builder.

Systems are formed with parts put together in a particular manner in order to pursuit an objective. The relationship between the parts determines what the system does and how it functions as a whole.

A system that does not change is a static (i.e., deterministic) system. Many of the systems we are part of are dynamic systems, which are they change over time. 

In ***deterministic models***, a good decision is judged by the outcome alone. However, in probabilistic models, the decision-maker is concerned not only with the outcome value but also with the amount of risk each decision carries.

Strategy of the ***Decision-Making Process***:

### $\mathit {\color{red} {Identify \ the \ decision \ situation \ and \ understand \ objectives \mapsto Identify \ alternatives \mapsto Decompose \ and \ model \ the \ problem \mapsto Choose \ the \ best\ alternative \mapsto Sensitivity \ Analysis \mapsto (Is \ further \ analysis \ needed?) \mapsto (if \ NO) \ Implement \ the \ chosen \ alternative}}$

Probabilistic Modeling: from Data to Information, from Information to Facts, and finally, from Facts to Knowledge. 

***Source of Errors*** in Decision Making: false assumptions, not having an accurate estimation of the probabilities, relying on expectations, difficulties in measuring the utility function, and forecast errors.

The deficiencies about our knowledge of the future may be divided into three domains, each with rather murky boundaries:

***Risk:*** One might be able to enumerate the outcomes and figure the probabilities. However, one must lookout for non-normal distributions, especially those with “fat tails”, as illustrated in the stock market by the rare events.

***Uncertainty:*** One might be able to enumerate the outcomes but the probabilities are murky. Most of the time, the best one can do is to give a rank order to possible outcomes and then be careful that one has not omitted one of significance.

Uncertainty is the fact of processes or business; probability is the guide for "good" events in processes or business.

***Black Swans:*** The name comes from an Australian genetic anomaly. This is the domain of events which are either “extremely unlikely” or “inconceivable” but when they happen, and they do happen, they have serious consequences, usually bad.

***Reinforcement Learning*** is a type of Machine Learning, and thereby also a branch of Artificial Intelligence. It allows machines and software agents to automatically determine the ideal behaviour within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behaviour; this is known as the reinforcement signal.

There are three fundamental problems that RL must tackle: the exploration-exploitation tradeoff, the problem of delayed reward (credit assignment), and the need to generalize. 

A ***Markov Decision Process (MDP)*** is just like a Markov Chain, except the transition matrix depends on the action taken by the decision maker (agent) at each time step. The agent receives a reward, which depends on the action and the state. The goal is to find a function, called a policy, which specifies which action to take in each state, so as to maximize some function (e.g., the mean or expected discounted sum) of the sequence of rewards. One can formalize this in terms of Bellman's equation, which can be solved iteratively using policy iteration. The unique fixed point of this equation is the optimal policy.

A Markov Decision Process (MDP) model contains:
    
- A set of possible world states S
- A set of possible actions A
- A real valued reward function R(s,a)
- A description T of each action’s effects in each state.

***Markov Property***: the effects of an action taken in a state depend only on that state and not on the prior history.

***Deterministic Actions:*** For each state and action we specify a new state.

***Stochastic Actions:*** For each state and action we specify a probability distribution over next states

Let's define the transition matrix and reward functions. We are assuming states, actions and time are discrete. 
#### T(s,a,s') = Pr[S(t+1)=s' | S(t)=s, A(t)=a] 
#### R(s,a,s') = E[R(t+1)| S(t)=a, A(t)=a, S(t+1)=s'] 
Continuous MDPs can also be defined, but are usually solved by discretization.

We define the value of performing action a in state s as follows:

#### Q(s,a) = sum_s' T(s,a,s') [ R(s,a,s') + g V(s') ]  
where 0 < g <= 1 is the amount by which we discount future rewards, and V(s) is overall value of state s, given by 
Bellman's equation:
#### V(s) = max_a Q(s,a) = max_a sum_s' T(s,a,s') [ R(s,a,s') + g V(s) ]
In words, the value of a state is the maximum expected reward we will get in that state, plus the expected discounted value of all possible successor states, s'. If we define
#### R(s,a) = E[ R(s,a,s') ] = sum_{s'} T(s,a,s') R(s,a,s')
the above equation simplifies to the more common form
#### V(s) = max_a R(s,a) + sum_s' T(s,a,s') g V(s') 
which, for a fixed policy and a tabular (non-parametric) representation of the V/Q/T/R functions, can be rewritten in matrix-vector form as V = R + g T V. 

Solving these n simultaneous equations is called value determination (n is the number of states).

If V/Q satisfies the Bellman equation, then the greedy policy
#### p(s) = argmax_a Q(s,a)
is optimal. If not, we can set p(s) to argmax_a Q(s,a) and re-evaluate V (and hence Q) and repeat. This is called policy iteration, and is guaranteed to converge to the unique optimal policy. 

The best theoretical upper bound on the number of iterations needed by policy iteration is exponential in n, but in practice, the number of steps is O(n). By formulating the problem as a linear program, it can be proved that one can find the optimal policy in polynomial time.

http://stats.stackexchange.com/questions/145122/real-life-examples-of-markov-decision-processes
#### MDP=⟨S,A,T,R,γ⟩
where S are the states, A the actions, T the transition probabilities (i.e. the the probabilities Pr(s′|s,a) to go from one state to another given an action), R the rewards (given a certain state, and possibly action), and γ is a discount factor that is used to reduce the importance of the of future rewards.

So in order to use it, you need to have predefined:

States: these can refer to for example grid maps in robotics, or for example door open and door closed.
    
Actions: a fixed set of actions, such as for example going north, south, east, etc for a robot, or opening and closing a door.

Transition probabilities: the probability of going from one state to another given an action. For example what is the probability of an open door if the action is open. In a perfect world the later could be 1.0, but if it is a robot, it could have failed in handling the door knob correctly. Another example in the case of a moving robot, would be the action north, which in most cases would bring it in the grid cell north of it, but in some cases could have moved too much and reached the next cell for example.

Rewards: these are used to guide the planning. In the case of the grid example we might want to go to a certain cell, and the reward will be higher if we get closer. In the case of the door example an open door might give a high reward.

#### Examples of Applications of MDPs

- Harvesting: how much members of a population have to be left for breeding.
- Agriculture: how much to plant based on weather and soil state.
- Water resources: keep the correct water level at reservoirs.
- Inspection, maintenance and repair: when to replace/inspect based on age, condition, etc.
- Purchase and production: how much to produce based on demand.
- Queues: reduce waiting time.
...

- Finance: deciding how much to invest in stock.
- Robotics:

     - A dialogue system to interact with people.
     - Robot bartender.
     - Robot exploration for navigation.
...

Dynamic Programming is a very general solution method for problems which have two properties:
    
- Optimal substructure
  - Principle of optimality applies
  - Optimal solution can be decomposed into subproblems
- Overlapping subproblems
  - Subproblems recur many times
  - Solutions can be cached and reused
  
Markov decision processes satisfy both properties

- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions

Dynamic programming is used for planning in an MDP

- For prediction:
  - Input: MDP (S, A, P, R, γi) and policy π
  - or: MRP (S, Pπ, Rπ, γi)
  - Output: value function vπ
- Or for control:
  - Input: MDP (S, A, P, R, γi)
  - Output: optimal value function v∗
  - and: optimal policy π∗