<a href="https://colab.research.google.com/github/SanjayS2348553/Reinforcement-Learning/blob/main/2348553_SANJAY_S_RL_Lab_08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import random

# Environment setup
class GridWorld:
    def __init__(self, n_states=5):
        self.n_states = n_states
        self.state = 1  # Start state
        self.terminal_states = [1, n_states]

    def reset(self):
        self.state = random.randint(2, self.n_states - 1)  # Start from a random non-terminal state
        return self.state

    def step(self, action):
        """
        Action: -1 (left), +1 (right)
        Returns: next_state, reward, done
        """
        next_state = self.state + action
        if next_state <= 1:
            next_state = 1
            reward = -1  # Penalty for hitting left terminal state
            done = True
        elif next_state >= self.n_states:
            next_state = self.n_states
            reward = 1  # Reward for hitting right terminal state
            done = True
        else:
            reward = 0  # No reward for intermediate states
            done = False

        self.state = next_state
        return next_state, reward, done


# TD(0) Algorithm
def td_zero(env, gamma=0.9, alpha=0.1, episodes=100):
    V = np.zeros(env.n_states + 1)  # State value function (1-based indexing)
    for episode in range(episodes):
        state = env.reset()
        done = False

        while not done:
            action = random.choice([-1, 1])  # Random policy
            next_state, reward, done = env.step(action)
            # TD(0) update rule
            V[state] += alpha * (reward + gamma * V[next_state] - V[state])
            state = next_state

    return V


# Main
if __name__ == "__main__":
    env = GridWorld(n_states=5)
    episodes = 500
    alpha = 0.1
    gamma = 0.9

    value_function = td_zero(env, gamma=gamma, alpha=alpha, episodes=episodes)

    print("State Value Function:")
    for state in range(1, env.n_states + 1):
        print(f"State {state}: {value_function[state]:.2f}")


State Value Function:
State 1: 0.00
State 2: -0.72
State 3: -0.25
State 4: 0.42
State 5: 0.00


Environment:

The agent operates in a 1D grid world with 5 states numbered from 1 to 5.
States 1 and 5 are terminal states with rewards -1 and +1, respectively.
Actions move the agent left (-1) or right (+1). Intermediate states provide a reward of 0.
TD(0) Update Rule:

The state-value function
𝑉
(
𝑠
)
V(s) is updated as follows:
𝑉
(
𝑠
)
←
𝑉
(
𝑠
)
+
𝛼
[
𝑅
+
𝛾
𝑉
(
𝑠
′
)
−
𝑉
(
𝑠
)
]
V(s)←V(s)+α[R+γV(s
′
 )−V(s)]
Where:
𝛼
α: Learning rate
𝛾
γ: Discount factor
𝑅
R: Reward for the transition
𝑠
′
s
′
 : Next state
Random Policy:

The agent chooses actions randomly to explore the environment.
Results:

After training, the algorithm outputs the learned value of each state.