# Intro into RL 

This is an introduction to RL concepts. RL makes use of a framework called Markov Decision Process (MDP). This framework is helpful when solving problems where you have to make sequential decisions and want to find the optimal decisions to maxmize a goal. In this notebook we implmenent a simple MDP and use the Value Iteration algorithm to find the most optimal action for our agent to take

In [None]:
import numpy as np

# Define the MDP

In [19]:

# Define the MDP
states = [0, 1, 2]  # States
actions = [0, 1]  # Actions: 0 - left, 1 - right
transition_probabilities = {
    0: {0: [(1.0, 0)], 1: [(1.0, 1)]},  # From state 0: action 0 -> state 0, action 1 -> state 1
    1: {0: [(1.0, 0)], 1: [(1.0, 2)]},  # From state 1: action 0 -> state 0, action 1 -> state 2
    2: {0: [(1.0, 2)], 1: [(1.0, 2)]}   # From state 2: action 0 -> state 2, action 1 -> state 2
}
rewards = {
    0: {0: -1, 1: 0},  # From state 0: action 0 gives -1, action 1 gives 0
    1: {0: 0, 1: 1},   # From state 1: action 0 gives 0, action 1 gives 1
    2: {0: 0, 1: 0}    # From state 2: both actions give 0
}
gamma = 0.9  # Discount factor


# Solve MDP (using Value Iteration)

In [20]:
# Value Iteration to find the optimal value function
def value_iteration(states, actions, transition_probabilities, rewards, gamma, epsilon=1e-6):
    V = np.zeros(len(states))  # Initialize value function
    while True:
        delta = 0  # To track the change in value function
        for state in states:
            v = V[state]
            max_value = float('-inf')
            for action in actions:
                value = sum(p * (rewards[state][action] + gamma * V[next_state]) 
                            for p, next_state in transition_probabilities[state][action])
                max_value = max(max_value, value)
            V[state] = max_value
            delta = max(delta, abs(v - V[state]))
        if delta < epsilon:
            break
    return V

In [23]:
# Compute the value function
V_optimal = value_iteration(states, actions, transition_probabilities, rewards, gamma)
print("Optimal Value Function:", V_optimal)


Optimal Value Function: [0.9 1.  0. ]


# Note: Bellman equation tells us how good being at a particular state is. States closer to the highest reward (usually the goal) will have greater value.
# you should display this information so it's clear