# Reinforcement Learning
<hr>

It is about learning the optimal behavior in an environment to obtain maximum reward. Given a set of rewards or punishment, learn actions to take in the future.

![reinforcement](./images/reinforcement.PNG)

**Agent**
- The agent perform action
- The environment gives the agent a state
- The environment gives a state and reward (or punishment)

**N.B:** This is how robots are taught to walk.

## Markov Decision Process
- Model for decision-making, representing state, action, and their reward.
- Set of state $S$
- Set of action $Action (x)$
- Transition model $P(s' |s,a)$
- Reward function $R(s,a,s')$

## Q-learning (one model)
Method for learning a function $Q(s, a)$, estimate of the value of performing action $a$ in state $s$.

- Start with $Q(s, a) = 0$ for all $s, a$
- Update $Q$ when we take an action
- $Q(s, a) = Q(s, a) + \alpha($reward$ + \gamma\max(s', a') - Q(s, a)) = (1 - \alpha)Q(s, a) + \alpha($reward$ + \gamma\max(s', a'))$

$\alpha$ - learning rate

## $\epsilon$-Greedy Decision Making
**Explore vs Exploit**
- With propability $\epsilon$ take a random move
- Otherwise, take action $a$ with maximum $Q(s, a)$

### Simple Task

![field](./images/field.PNG)

- Starts at a random point
- Move left or right
- Avoid the red box (you lose the game)
- Find the green box (you win the game)

![field](./images/field2.PNG)

In [1]:
import random
import numpy as np

In [2]:
class Field:
    def __init__(self):
        # define the negative, neutral, and positive state.
        self.states = [-1,0,0,0,0,0,0,1,0,0,0]
        self.state = random.randrange(0,len(self.states))

In [4]:
field = Field()
field.state

7

In [5]:
class Field:
    def __init__(self):
        # define the negative, neutral, and positive state.
        self.states = [-1,0,0,0,0,0,0,1,0,0,0]
        self.state = random.randrange(0,len(self.states))
        
    def done(self):
        if self.states[self.state] != 0:
            return True
        else:
            return False

In [14]:
field = Field()
field.state, field.done()

(7, True)

In [18]:
class Field:
    def __init__(self):
        # define the negative, neutral, and positive state.
        self.states = [-1,0,0,0,0,0,0,1,0,0,0]
        self.state = random.randrange(0,len(self.states))
        
    def done(self):
        if self.states[self.state] != 0:
            return True
        else:
            return False
        
    def get_possible_action(self):
        # action 0 -> left, action 1 -> right
        actions = [0,1]
        if self.state == 0:
            actions.remove(0)
        elif self.state == len(self.states) -1:
            actions.remove(1)
        return actions

In [25]:
field = Field()
field.state, field.done(), field.get_possible_action()

(10, False, [0])

In [26]:
class Field:
    def __init__(self):
        # define the negative, neutral, and positive state.
        self.states = [-1,0,0,0,0,0,0,1,0,0,0]
        self.state = random.randrange(0,len(self.states))
        
    def done(self):
        if self.states[self.state] != 0:
            return True
        else:
            return False
        
    def get_possible_action(self):
        # action 0 -> left, action 1 -> right
        actions = [0,1]
        if self.state == 0:
            actions.remove(0)
        elif self.state == len(self.states) -1:
            actions.remove(1)
        return actions
    
    def update_next_state(self,action):
        if action == 0:
            if self.state == 0:
                return self.state, -10
            self.state -= 1
        if action == 1:
            if self.state == len(self.states) - 1:
                return self.state, -10
            self.state += 1
            
        reward = self.states[self.state]
        return self.state, reward

In [48]:
field = Field()
field.state, field.done(),field.get_possible_action()

(10, False, [0])

In [49]:
field.update_next_state(1)

(10, -10)

In [53]:
field = Field()
q_table = np.zeros((len(field.states),2))

alpha = .5 # learning rate
epsilon = .5
gamma = .5

for _ in range(10000):
    field = Field()
    while not field.done():
        actions = field.get_possible_action()
        if random.uniform(0,1) < epsilon:
            action = random.choice(actions)
        else:
            action = np.argmax(q_table[field.state])
            
        cur_state = field.state
        next_state, reward = field.update_next_state(action)
        
        q_table[cur_state, action] = (1-alpha)*q_table[cur_state,action]+alpha*(reward + gamma*np.max(q_table[next_state]))

In [54]:
q_table

array([[ 0.      ,  0.      ],
       [-1.      ,  0.03125 ],
       [ 0.015625,  0.0625  ],
       [ 0.03125 ,  0.125   ],
       [ 0.0625  ,  0.25    ],
       [ 0.125   ,  0.5     ],
       [ 0.25    ,  1.      ],
       [ 0.      ,  0.      ],
       [ 1.      ,  0.25    ],
       [ 0.5     ,  0.125   ],
       [ 0.25    ,  0.      ]])