## Reinforcement Learning

<img align="left" src="https://drive.google.com/uc?export=view&id=1sXQGiTLfPJi0S0rwPjSQazChNbAp38Ek"     style=" width:1000px; padding: 10px; " >

- Reinforcement learning is a subfield of machine learning that focuses on developing algorithms that enable an agent to learn how to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties, and its goal is to learn how to take actions that maximize its cumulative rewards over time.


- In reinforcement learning, the agent interacts with the environment in discrete time steps. At each time step, the agent observes the current state of the environment and selects an action to take based on a policy, which is a function that maps states to actions. The environment then transitions to a new state and provides the agent with a reward signal, which is a scalar value that indicates how well the agent is doing at the current time step. The agent then updates its policy based on the observed state, action, and reward, and continues to interact with the environment.


- The goal of the agent is to learn a policy that maximizes its expected cumulative reward over time. This requires balancing the desire to maximize immediate rewards with the need to explore new actions that may lead to higher long-term rewards. Reinforcement learning algorithms use various techniques to balance exploration and exploitation and learn an optimal policy, such as Q-learning, policy gradient methods, and actor-critic methods.


- Reinforcement learning has many practical applications, including robotics, game playing, recommendation systems, and autonomous vehicles.

- Agent: Learns from trial & error
- Environment: Where the agents moves
- Action: All possible steps agent can take
- State: Current condition in the environment
- Reward: An apraisal of the last action
- Policy: Agent uses to determine the next action based on the current state
- Value: Expected long term return, as opposed to short term reward
- Action-value: Similar to value, but it takes an extra parameter, the current action

## Markov's Decision Process

<img align="left" src="https://drive.google.com/uc?export=view&id=1gLgtvw0PN5t-PMbcDqrz7goftBWU08KY"     style=" width:1000px; padding: 10px; " >

- The mathematical approach for mapping a solution in reinforcement learning is called MDP
- Following parameters are used in attaining a solution in MDP:
    - Set of actions, A
    - Set of states, S
    - Reward, R
    - Policy, pi
    - Value, V
 
- Goal: Find the shortest path between A & D with minimum possible cost
    - Set of states are denoted by nodes
    - Action are denoted by edges
    - Reward is the cost of an edge
    - Policy is the path take to reach the destination


## Q-Learning

<img align="right" src="https://drive.google.com/uc?export=view&id=1xEyQSD5XBtECR6arVjporNzX_OPM7M06"     style=" width:900px; padding: 10px; " >
<img align="right" src="https://drive.google.com/uc?export=view&id=16kukmANov90PTpKJUMwwMc4o3WcfKgFs"     style=" width:900px; padding: 10px; " >

- Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions]
- The Gamma parameter has a range of 0 to 1 (0 <= Gamma < 1)
  - if Gamma is set closer to zero, agent will tend to consider only immediate rewards - exploitation
  - if Gamma is set closer to one, agent will consider higher rewards - exploration

## Q-Learning Example
- Set gamma = 0.8
- Initialize Q Matrix as zero
- From node 1, agent can either go to node 3 or 5; let's select 5
- From node 5, calculate maximum Q value for this next state based on potential actions
- Q(state, action) = R(state, action) + gamma * max[Q(next state, all actions)]

<img align="left" src="https://drive.google.com/uc?export=view&id=1zXSq-VIviv2qAbwI5HMef3nrT7krINFJ"     style=" width:1000px; padding: 10px; " >


### R Matrix: State diagram and instant reward matrix
<img align="left" src="https://drive.google.com/uc?export=view&id=1xEyQSD5XBtECR6arVjporNzX_OPM7M06"     style=" width:1000px; padding: 10px; " >

In [1]:
import numpy as np

R = np.matrix([
    [-1, -1, -1, -1, 0, -1],
    [-1, -1, -1, 0, -1, 100],
    [-1, -1, -1, 0, -1, -1],
    [-1, 0, 0, -1, 0, -1],
    [ 0, -1, -1, 0, -1, 100],
    [-1, 0, -1, -1, 0, 100]
])

In [2]:
Q = np.matrix(np.zeros([6,6]))
gamma = 0.8
initial_state = 1
Q

matrix([[0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0.]])

In [3]:
def available_actions(state):
    current_state_row = R[state,]
    av_act = np.where(current_state_row >= 0)[1]
    return av_act

In [4]:
available_act = available_actions(initial_state)
available_act

array([3, 5])

In [9]:
def sample_next_action(available_actions_range):
    next_action = int(np.random.choice(available_actions_range, 1))
    return next_action

action = sample_next_action(available_act)
action

3

In [7]:
def update(current_state, action, gamma):
    max_index = np.where(Q[action,] == np.max(Q[action,]))[1]
    
    if max_index.shape[0] > 1:
        max_index = int(np.random.choice(max_index, size = 1))
    else:
        max_index = int(max_index)
    
    max_value = Q[action, max_index]
    
    Q[current_state, action] = R[current_state, action] + gamma * max_value

In [None]:
for i in range(10000):
    current_state = np.random.randint(0, int(Q.shape[0]))
    available_act = available_actions(current_state)
    action = sample_next_action(available_act)
    update(current_state, action, gamma)

print('Trained Q Matrix')
print(Q)
print("\n\n")

print(Q/np.max(Q)*100) # normalized

Trained Q Matrix
[[  0.   0.   0.   0. 400.   0.]
 [  0.   0.   0. 320.   0. 500.]
 [  0.   0.   0. 320.   0.   0.]
 [  0. 400. 256.   0. 400.   0.]
 [320.   0.   0. 320.   0. 500.]
 [  0. 400.   0.   0. 400. 500.]]



[[  0.    0.    0.    0.   80.    0. ]
 [  0.    0.    0.   64.    0.  100. ]
 [  0.    0.    0.   64.    0.    0. ]
 [  0.   80.   51.2   0.   80.    0. ]
 [ 64.    0.    0.   64.    0.  100. ]
 [  0.   80.    0.    0.   80.  100. ]]


In [None]:
R

matrix([[ -1,  -1,  -1,  -1,   0,  -1],
        [ -1,  -1,  -1,   0,  -1, 100],
        [ -1,  -1,  -1,   0,  -1,  -1],
        [ -1,   0,   0,  -1,   0,  -1],
        [  0,  -1,  -1,   0,  -1, 100],
        [ -1,   0,  -1,  -1,   0, 100]])

In [None]:
current_state = 0
steps = [current_state]
print(steps)
while current_state != 5:
    next_step_index = np.where(Q[current_state,] == np.max(Q[current_state,]))[1]
    if next_step_index.shape[0] > 1:
        next_step_index = int(np.random.choice(next_step_index, size=1))
    else:
        next_step_index = int(next_step_index)
        
    steps.append(next_step_index)
    current_state = next_step_index

print('Selected path: ')
print(steps)

[0]
Selected path: 
[0, 4, 5]
