<a href="https://colab.research.google.com/github/Donalizasaji/LAB/blob/main/2348515_RL_Lab_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Markov Decision Process (MDP)** Class

In [1]:
import numpy as np

class MDP:
    def __init__(self, states, actions, transition_probs, rewards, discount_factor):
        self.states = states
        self.actions = actions
        self.transition_probs = transition_probs  # Transition probability matrix P(s'|s,a)
        self.rewards = rewards                    # Reward matrix R(s,a)
        self.discount_factor = discount_factor    # Discount factor (gamma)

    def step(self, state, action):
        """Performs a state transition given a state and an action."""
        next_state = np.random.choice(self.states, p=self.transition_probs[state][action])
        reward = self.rewards[state][action]
        return next_state, reward

    def simulate(self, start_state, policy, steps=100):
        """Simulates the MDP for a number of steps given a policy."""
        state = start_state
        total_reward = 0
        for step in range(steps):
            action = policy[state]
            next_state, reward = self.step(state, action)
            total_reward += reward
            print(f"Step {step}: State = {state}, Action = {action}, Reward = {reward}, Next State = {next_state}")
            state = next_state
        return total_reward


**Value Iteration Algorithm**

In [2]:
def value_iteration(mdp, threshold=1e-6):
    """Performs the value iteration algorithm."""
    V = np.zeros(len(mdp.states))  # Initialize value function to 0
    policy = np.zeros(len(mdp.states), dtype=int)  # Initialize policy to 0

    while True:
        delta = 0
        for state in mdp.states:
            q_values = np.zeros(len(mdp.actions))
            for action in mdp.actions:
                q_values[action] = sum(
                    [mdp.transition_probs[state][action][next_state] *
                     (mdp.rewards[state][action] + mdp.discount_factor * V[next_state])
                     for next_state in mdp.states]
                )
            best_action_value = np.max(q_values)
            delta = max(delta, np.abs(best_action_value - V[state]))
            V[state] = best_action_value
            policy[state] = np.argmax(q_values)  # Update policy

        # If the value function change is smaller than threshold, stop
        if delta < threshold:
            break

    return V, policy


**Example Usage**

In [3]:
if __name__ == "__main__":
    # Define MDP components: states, actions, transition probabilities, rewards, discount factor
    states = [0, 1, 2]
    actions = [0, 1]  # Action 0 or 1
    transition_probs = {
        0: {0: [0.7, 0.2, 0.1], 1: [0.1, 0.8, 0.1]},
        1: {0: [0.4, 0.5, 0.1], 1: [0.2, 0.7, 0.1]},
        2: {0: [0.1, 0.3, 0.6], 1: [0.05, 0.25, 0.7]}
    }
    rewards = {
        0: {0: 5, 1: 10},
        1: {0: -1, 1: 2},
        2: {0: 0, 1: 3}
    }
    discount_factor = 0.9

    # Initialize MDP and run value iteration
    mdp = MDP(states, actions, transition_probs, rewards, discount_factor)
    value_function, optimal_policy = value_iteration(mdp)

    print("Optimal Value Function:", value_function)
    print("Optimal Policy:", optimal_policy)

    # Simulate the MDP starting from state 0 using the optimal policy
    total_reward = mdp.simulate(0, optimal_policy, steps=10)
    print(f"Total reward after simulation: {total_reward}")
Markov Decision Process (MDP) Class

Optimal Value Function: [40.56840187 33.22895248 33.2488969 ]
Optimal Policy: [1 1 1]
Step 0: State = 0, Action = 1, Reward = 10, Next State = 1
Step 1: State = 1, Action = 1, Reward = 2, Next State = 1
Step 2: State = 1, Action = 1, Reward = 2, Next State = 1
Step 3: State = 1, Action = 1, Reward = 2, Next State = 1
Step 4: State = 1, Action = 1, Reward = 2, Next State = 0
Step 5: State = 0, Action = 1, Reward = 10, Next State = 1
Step 6: State = 1, Action = 1, Reward = 2, Next State = 1
Step 7: State = 1, Action = 1, Reward = 2, Next State = 1
Step 8: State = 1, Action = 1, Reward = 2, Next State = 2
Step 9: State = 2, Action = 1, Reward = 3, Next State = 2
Total reward after simulation: 37


**MDP Class:**

Defines the states, actions, transition probabilities, rewards, and discount factor.

The step function takes a state and an action, performs a transition based on the given probabilities, and returns the next state and reward.

The simulate function runs the MDP for a number of steps following a given policy.

**Value Iteration Algorithm:**

Iteratively updates the value function V by computing the expected value for each state-action pair.

Once the value function converges (changes by less than a small threshold), it stops and returns the optimal value function and policy.