HRR - Reinforcement learning
===========

Distal reward problem solution using HRRs
----------------


In [11]:
from core import HRR

Imagine an experiment in which mouse starts in the middle of corridor by observing light that can be red or green. After that he chooses to go left or right to the end of the corridor, where there might or not be cheese. Initial light gives information about the direction in which cheese is: red light means that cheese is left and green light means it is right.

Lets create our symbolic representations for sensory input, actions and reward signals:

In [23]:
HRR.reset_kernel()
HRR.default_size = 1000
HRR.verbose = False
red_light, green_light = HRR("red_light"), HRR("green_light")
left_turn, right_turn = HRR("left_turn"), HRR("right_turn")
reward, punishment = HRR("reward"), HRR("punishment")

After first run mouse sees red light, choses to go left and receives a reward, which is memorized like this:

In [24]:
run1 = red_light * left_turn * reward

Using just this experience, we can try to check what is the current policy that mouse has learned just from a single trial:

In [25]:
print("Seeing red_light mouse would choose action: {}".format((run1 % reward) / red_light))
print("Seeing green_light mouse would choose action: {}".format((run1 % reward) / green_light))

Seeing red_light mouse would choose action: left_turn
Seeing green_light mouse would choose action: red_light


As we can see, second answer doesn't make sense since mouse hasn't explored that case yet.

Now we will suppose that mouse had 4 chances to perform experiment, covering (idealy) the whole state-action space:

In [26]:
run2 = red_light * right_turn * punishment
run3 = green_light * left_turn * punishment
run4 = green_light * right_turn * reward

all_runs = run1 + run2 + run3 + run4

We will repeat query from previous step, asking the complete policy all_runs what are the optimal actions for both sensory inputs for reaching reward:

In [27]:
print("Seeing red_light mouse would choose action: {}".format(all_runs / ( reward * red_light)))
print("Seeing green_light mouse would choose action: {}".format(all_runs / (reward * green_light)))

Seeing red_light mouse would choose action: left_turn
Seeing green_light mouse would choose action: right_turn


As we can see, mouse knows how to navigate the maze optimally and the sensor->action policy is stored in all_runs HRR.

Lets assume that there were two signal lights at the beginning, one of them having same colors and function as before experiment modification and another one being a completely irrelevant distractor. Second light can be yellow_light or blue_light, whoch is selected randomly at each run and cheese position doesn't depend on it.

In this case, mouse will memorize both percieved stimuli (red/green + yellow/blue), so the memory after couple of runs might look like this:

In [28]:
yellow_light, blue_light = HRR("yellow_light"), HRR("blue_light")

run1 = (red_light + yellow_light) * left_turn * reward 
run2 = (red_light + blue_light) * left_turn * reward
run3 = (red_light + yellow_light) * right_turn * punishment
run4 = (green_light + yellow_light) * right_turn * reward
#run5 = green_light * blue_light * right_turn * reward
all_runs = run1 + run2 + run3 + run4# + run5

HRR.verbose = True

In [29]:
print("Common sensory stimuli was: {}".format(all_runs / (reward * (green_light + blue_light))))

Distance from green_light is 0.0549922588493
Distance from right_turn is 0.131932534883
Distance from blue_light is -0.0802718462334
Distance from red_light is 0.0955334244916
Distance from yellow_light is 0.00319471517103
Distance from left_turn is 0.11997377468
Distance from reward is 0.0637256176354
Distance from punishment is -0.00664726946128
Common sensory stimuli was: {'left_turn': 0.11997377467958976, 'right_turn': 0.13193253488347595}


HRR - Action selection
====

Lets define a set of symbolic sensory input values for **tactile** sense (touch_left, no_touch), **eyes** (cheese_left, cheese_right, no cheese) and set of **motor actions** for a wheel rotation (forward, backward). "Program" for approaching the cheese and moving away from wall would look as follows:

        if (eyes == cheese_left):
           left_wheel = forward
           right_wheel = backward

       if (eyes == cheese_right):
           left_wheel = backward
           right_wheel = forward

       if (tactile == touch_left):
           left_wheel = backward
           right_wheel = forward
    
       if (tactile == touch_right):
           left_wheel = forward
           right_wheel = backward

    



In [30]:
from core import HRR

We will declare all of our symbols first:

In [31]:
HRR.reset_kernel()
HRR.default_size = 100
HRR.verbose = False

forward = HRR("forward")
backward = HRR("backward")
no_motion = HRR("no_motion")
touch_left, touch_right, no_touch = HRR("touch_left"), HRR("touch_right"), HRR("no_touch")
cheese_left, cheese_right, no_cheese = HRR("cheese_left"), HRR("cheese_right"), HRR("no_cheese")

left_wheel = HRR("left_wheel")
right_wheel = HRR("right_wheel")

Now we will write a program for handling left_touch and right_touch events on haptic sensors separatelly. 

\begin{equation}
\center
left_touch_program=T_L \cross (A_F \cross W_L + A_B \cross W_R)
\end{equation

For left_touch event we want to set left_wheel to forward and right_wheel to backward, so we move away from obstacle. We will do this using **binding operation**. We want to do opposite thing for right_touch, to result in motion that drives agent away from obstacle. 

Combining these two cases into a single program is done with **superposition**:  

In [43]:
left_touch_program = touch_left * (left_wheel * forward + right_wheel * backward)

right_touch_program = touch_right * (left_wheel * backward+ right_wheel * forward)

avoidance_program = left_touch_program + right_touch_program

print(avoidance_program.memory)

[ 0.44537222  0.35882483 -0.23232629  0.09366966 -0.28087172 -0.48398365
 -0.4333629   0.27101847  0.28315966  0.32739683  0.11767    -0.05710428
  0.02700785 -0.399639   -0.23841391  0.06135454  0.31311104  0.14712888
  0.24306917  0.27911278 -0.42853367 -0.3276419   0.03617653  0.16088609
  0.21274095  0.20611933  0.32175012 -0.07516708 -0.06277486 -0.4443288
 -0.15842378 -0.18774508  0.34443934  0.26499677  0.32492067  0.13058903
 -0.42433628 -0.31470214 -0.41283554 -0.27447538  0.05107525  0.45240108
  0.10323443  0.08975298  0.02096106 -0.0761948  -0.47203608 -0.44010774
  0.07624263 -0.10761523  0.25971956  0.03903721 -0.27842012 -0.24407275
 -0.52718159  0.03637453  0.03172395  0.48716245  0.06852792  0.31758278
 -0.20993405 -0.31811356 -0.38908562 -0.55858502 -0.11570875  0.19025514
  0.13819692  0.47863232 -0.10351411  0.27072206 -0.5578898  -0.22527813
 -0.12253516  0.12879428  0.14031171  0.13819048  0.34170957 -0.03874256
 -0.22116381 -0.40607057  0.10114132 -0.43566521  0.

Now we can test avoidance program. Let's assume we have detected symbolic input "touch_right" from our tactile sensor. In order to read all of the actions from avoidance program for this sensory input we will probe it:

In [44]:
left_wheel_action = avoidance_program / (touch_right * left_wheel)
print('Left wheel: {}'.format(left_wheel_action))

right_wheel_action = avoidance_program / (touch_right * right_wheel)
print('Right wheel: {}'.format(right_wheel_action))

Left wheel: backward
Right wheel: forward


We can perform the query as 2-step process, there we can obtain joint representation for all actions that need to be performed when touch_right has been sensed:

In [40]:
all_touch_right_actions = avoidance_program % touch_right
print(all_touch_right_actions.memory)

[-0.09627867  0.23448294  0.50859683  0.60035041  0.33385294  0.10352229
 -0.439971   -0.57979944 -0.3372838  -0.03248078  0.26646285  0.39193102
  0.15599485  0.00673835  0.00685249 -0.43764906 -0.34562794 -0.1651218
  0.61955286  0.25049943  0.4647431  -0.32108515 -0.03686391 -0.43372397
 -0.66189257 -0.11845402  0.43034372  0.38254897  0.15680907  0.37838338
  0.0410214  -0.73979956 -0.45604494 -0.10594229  0.18118695  0.14780548
  0.50614592  0.34397667 -0.3482567  -0.34230276 -0.67846352 -0.02635783
 -0.05126454  0.12835354  0.28635477  0.48716178 -0.01463486 -0.36855184
 -0.50570439 -0.42900249  0.0996987   0.11855436  0.21533274  0.25632106
 -0.18005986 -0.22606901 -0.46614787 -0.16953232 -0.01407183 -0.1034648
  0.33864231  0.57970582  0.08234457 -0.48757132 -0.36689228 -0.59486836
 -0.44901753 -0.17832053  0.27589703  0.30626424  0.18001431 -0.01878266
 -0.194415   -0.30740008 -0.44504998 -0.373239    0.3217952   0.29504135
 -0.04777484  0.10611516 -0.25404163 -0.38611481 -0.4

In order to determine the individual actions for wheels, we will probe the result of previous operation with resprective symbolic representations:

In [41]:
left_wheel_action = all_touch_right_actions / left_wheel
print('Action for left wheel: {}'.format(left_wheel_action))
right_wheel_action = all_touch_right_actions / right_wheel
print('Action for right wheel: {}'.format(right_wheel_action))

Action for left wheel: backward
Action for right wheel: forward
