# Reinforcement Learning

This notebook serves as the supporting material for the chapter **Reinforcement Learning**. It illustrates the use of the [reinforcement](https://github.com/aimacode/aima-java/tree/AIMA3e/aima-core/src/main/java/aima/core/learning/reinforcement) package of the code repository. Here we'll examine how an agent can learn what to do in the absence of labeled examples of what to do, from rewards and punishments. 

In [1]:
%classpath add jar ../out/artifacts/aima_core_jar/aima-core.jar

Reinforcement learning refers to goal-oriented algorithms, which learn how to attain a complex objective (goal) or maximize along a particular dimension over many steps; for example, maximize the points won in a game over many moves. They can start from a blank slate, and under the right conditions they achieve superhuman performance. 

Consider an example of a problem of learning chess. A supervised agent needs to be told the correct move for each position it encounters, but such feedback is seldom available. Therefore, in the absence of feedback, the agent needs to know, that something good has happened when it accidentally checkmates its opponent and that something bad has happened when it gets checkmated. This kind of feedback is called a **reward** or **reinforcement**. Reinforcement learning differs from the supervised learning in a way that in supervised learning the training data has the answer label with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of the training dataset, it is bound to learn from its experience. 

Usually, in game playing, it is very hard for a human to provide accurate and consistent evaluations of a large number of positions. Therefore, the program is told when it has won or lost, and the agent uses this information to learn a reasonably accurate evaluation function.

## Some Definitions

Let's have a look at some important concepts before proceeding further: 

* **Reward** ($R$): A reward is the feedback by which we measure the success or failure of an agent’s actions. From any given state, an agent sends output in the form of actions to the environment, and the environment returns the agent’s new state (which resulted from acting on the previous state) as well as rewards, if there are any. They effectively evaluate the agent’s action.
* **Policy** ($\pi$): The policy is the strategy that the agent employs to determine the next action based on the current state. It maps states to actions, the actions that promise the highest reward. The policy that yields the highest expected utility is known as **optimal policy**. We use $\pi^*$ to denote an optimal policy.
* **Discount factor** ($\gamma$): The discount factor is multiplied by future rewards as discovered by the agent in order to dampen thse rewards’ effect on the agent’s choice of action. Why? It is designed to make future rewards worth less than immediate rewards. If $\gamma$ is 0.8, and there’s a reward of 10 points after 3 time steps, the present value of that reward is 0.8³ x 10. A discount factor of 1 would make future rewards worth just as much as immediate rewards.
* **Transition model**: The transition model describes the outcome of each action in each state. If the outcomes are stochastic, we write $P(s'|s,a)$ to denote the probability of reaching state $s'$ if the action $a$ is done in state $s$. We'll assume the transitions are **Markovian** i.e. the probability of reaching $s'$ from $s$ depends only on $s$ and not on the history of earlier states. 
* **Utility** ($U(s)$): The utility  is defined to be the expected sum of discounted rewards if the policy $\pi$ is followed from that state onward.

## Passive Reinforcement Learning

In passive learning, the agent's policy $\pi$ is fixed: in state $s$, it always executes the action $\pi(s)$. It's goal is to learn how good a policy is - that is to learn a utility function $U^{\pi}(s)$. Note that the passive learning agent does not know the transition model $P(s'|s,a)$, which specifies the probability of reaching state $s'$, from state $s$ after doing action $a$; nor does it know the reward function $R(s)$, which specifies the reward for each state. The agent executes a set of trials in the environment using its policy $\pi$. In each trial, agent begins from the start-position and experience a sequence of state transition until it reaches one of the terminal states. Its percept supply both the current state and the reward receied in that state. The objective is to use the information about the rewards to learn the expected utility $U^{\pi}(s)$ associated with each non-terminal state $s$. 

Since, the utility values obey the Bellman equation for a fixed policy $\pi$, i.e. _the utility for each state equals its own reward  plus the expected utility of its successors states_,

$U^{\pi}(s) = R(s) + \gamma\sum_{s'}P(s' | s,\pi(s))U^\pi(s')$

### Adaptive Dynamic Programming

An adaptive dynamic programming agent takes advantage of the constraints among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using a dynamic programming method. For a passive learning agent, this means plugging a learned transition model $P(s'|s,\pi(s))$ and the observed reward $R(s)$ into the Bellman equation to calculate the utilities of states.

Let's have a look at the pseudo code of Passive ADP agent: 

In [17]:
%%python
from notebookUtils import *
pseudocode('Passive ADP Agent')

### AIMA3e
__function__ Passive-ADP-Agent(_percept_) __returns__ and action  
&emsp;__inputs__: _percept_, a percept indication the current state _s'_ and reward signal _r'_  
&emsp;__persistent__: _&pi;_, a fixed policy  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_mdp_, an MDP with model _P_, rewards _R_, discount &gamma;  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_U_, a table of utilities, initially empty  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>sa</sub>_, a table of frequencies for state-action pairs, initially zero  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_N<sub>s'|sa</sub>_, a table of outcome frequencies given state-action pairs, initially zero  
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;_s_, _a_, the previous state and action, initially null  
&emsp;__if__ _s'_ is new __then__ _U_[_s'_] &larr; _r'_; _R_[_s'_] &larr; _r'_  
&emsp;__if__ _s_ is not null __then__  
&emsp;&emsp;&emsp;increment _N<sub>sa</sub>_[_s_, _a_] and _N<sub>s'|sa</sub>_[_s'_, _s_, _a_]  
&emsp;&emsp;&emsp;__for each__ _t_ such that _N<sub>s'|sa</sub>_[_t_, _s_, _a_] is nonzero __do__  
&emsp;&emsp;&emsp;&emsp;&emsp;_P_(_t_ | _s_, _a_) &larr; _N<sub>s'|sa</sub>_[_t_, _s_, _a_] / _N<sub>sa</sub>_[_s_, _a_]  
&emsp;_U_ &larr; Policy-Evaluation(_&pi;_, _U_, _mdp_)  
&emsp;__if__ _s'_.Terminal? __then__ _s_, _a_ &larr; null __else__ _s_, _a_ &larr; _s'_, _&pi;_[_s'_]  

---
__Figure ??__ A passive reinforcement learning agent based on adaptive dynamic programming. The Policy-Evaluation function solves the fixed-policy Bellman equations, as described on page ??.

Let's see our Passive ADP agent in action! Consider a $4*3$ cell world with $[1,1]$ as the starting position. The policy $\pi$ for the $4*3$ world is shown in the figure below. This policy happens to be optimal with rewards of $R(s)=-0.04$ in the non-terminal states and no discounting.   

[![Optimal Policy][1]][1]

[1]: assets/optimal-policy.png

In [22]:
import aima.core.environment.cellworld.*;
import aima.core.learning.reinforcement.agent.PassiveADPAgent;
import aima.core.learning.reinforcement.example.CellWorldEnvironment;
import aima.core.probability.example.MDPFactory;
import aima.core.probability.mdp.impl.ModifiedPolicyEvaluation;
import aima.core.util.JavaRandomizer;

import java.util.*;;

CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;
CellWorldEnvironment cwe = new CellWorldEnvironment(
            cw.getCellAt(1, 1),
            cw.getCells(),
            MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),
            new JavaRandomizer());
Map<Cell<Double>, CellWorldAction> fixedPolicy = new HashMap<Cell<Double>, CellWorldAction>();
fixedPolicy.put(cw.getCellAt(1, 1), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(2, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(2, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(3, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(3, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(3, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(4, 1), CellWorldAction.Left);
PassiveADPAgent<Cell<Double>, CellWorldAction> padpa = new PassiveADPAgent<Cell<Double>, CellWorldAction>(
                                                                fixedPolicy,
                                                                cw.getCells(), 
                                                                cw.getCellAt(1, 1), 
                                                                MDPFactory.createActionsFunctionForFigure17_1(cw),
                                                                new ModifiedPolicyEvaluation<Cell<Double>, CellWorldAction>(10,1.0));
cwe.addAgent(padpa);
padpa.reset();
cwe.executeTrials(2000);

Map<Cell<Double>, Double> U = padpa.getUtility();
for(int i = 1; i<=4; i++){
    for(int j = 1; j<=3; j++){
        if(i==2 && j==2) continue; //Ignore wall
        System.out.println("[" + i + "," + j + "]"  + " \t:\t" + U.get(cw.getCellAt(i,j)));
    }
}

[1,1] 	:	0.7128593117885544
[1,2] 	:	0.7680398391451688
[1,3] 	:	0.8178806550835265
[2,1] 	:	0.6628583416987663
[2,3] 	:	0.8746799974574001
[3,1] 	:	null
[3,2] 	:	0.6938189410949245
[3,3] 	:	0.9241799994408929
[4,1] 	:	null
[4,2] 	:	-1.0
[4,3] 	:	1.0


null

Note that the cells $[3,1]$ and $[4,1]$ are not reachable when starting at $[1,1]$ using the policy and the default transition model i.e. 80% intended and 10% each right angle from intended.

The learning curves of the Passive ADP agent for the $4*3$ world (given the optimal policy) are shown below.

In [25]:
import aima.core.environment.cellworld.*;
import aima.core.learning.reinforcement.agent.PassiveADPAgent;
import aima.core.learning.reinforcement.example.CellWorldEnvironment;
import aima.core.probability.example.MDPFactory;
import aima.core.probability.mdp.impl.ModifiedPolicyEvaluation;
import aima.core.util.JavaRandomizer;

import java.util.*;

int numRuns = 20;
int numTrialsPerRun = 100;
int rmseTrialsToReport = 100;
int reportEveryN = 1;

CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;
CellWorldEnvironment cwe = new CellWorldEnvironment(
            cw.getCellAt(1, 1),
            cw.getCells(),
            MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),
            new JavaRandomizer());
Map<Cell<Double>, CellWorldAction> fixedPolicy = new HashMap<Cell<Double>, CellWorldAction>();
fixedPolicy.put(cw.getCellAt(1, 1), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(2, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(2, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(3, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(3, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(3, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(4, 1), CellWorldAction.Left);
PassiveADPAgent<Cell<Double>, CellWorldAction> padpa = new PassiveADPAgent<Cell<Double>, CellWorldAction>(
                                                                fixedPolicy,
                                                                cw.getCells(), 
                                                                cw.getCellAt(1, 1), 
                                                                MDPFactory.createActionsFunctionForFigure17_1(cw),
                                                                new ModifiedPolicyEvaluation<Cell<Double>, CellWorldAction>(10,1.0));
cwe.addAgent(padpa);
Map<Integer, List<Map<Cell<Double>, Double>>> runs = new HashMap<Integer, List<Map<Cell<Double>, Double>>>();
for (int r = 0; r < numRuns; r++) {
    padpa.reset();
    List<Map<Cell<Double>, Double>> trials = new ArrayList<Map<Cell<Double>, Double>>();
    for (int t = 0; t < numTrialsPerRun; t++) {
        cwe.executeTrial();
        if (0 == t % reportEveryN) {
            Map<Cell<Double>, Double> u = padpa.getUtility();
            trials.add(u);
        }
    }
    runs.put(r, trials);
}

null