## **Thompson Sampling Exploration Algorithm**

Thompson Sampling is a probabilistic approach to exploration. It maintains a posterior distribution of the expected reward for each action (typically using a Beta distribution in binary reward settings). For each action, the algorithm samples from the distribution and selects the action with the highest sampled reward.


**Imports**

In [3]:
import numpy as np
import gym

**Data Loading**

In [None]:
# Environment setup
env = gym.make('CartPole-v1')

# Hyperparameters
alpha = 1  # Prior success (Beta distribution)
beta = 1  # Prior failure (Beta distribution)
gamma = 0.99  # Discount factor

**Model Building**

In [None]:
# Q-table initialization and posterior distributions
successes = np.ones((env.observation_space.shape[0], env.action_space.n)) * alpha
failures = np.ones((env.observation_space.shape[0], env.action_space.n)) * beta

def thompson_sampling_policy(state):
    samples = np.random.beta(successes[state], failures[state])
    return np.argmax(samples)  # Select action with highest sample from the Beta distribution

def thompson_sampling_q_learning(env, n_episodes=1000):
    for episode in range(n_episodes):
        state = env.reset()
        done = False
        while not done:
            action = thompson_sampling_policy(state)  # Select action based on Thompson Sampling
            next_state, reward, done, _ = env.step(action)

            # Update Beta distributions based on success/failure
            if reward > 0:  # Assuming positive reward means success
                successes[state, action] += 1
            else:
                failures[state, action] += 1

            state = next_state

thompson_sampling_q_learning(env)