## 2.1

_Given is a six-armed bandit, as introduced in the lecture._

_The first arm shall sample its reward uniformly from the interval [1, 3)._

_The second arm shall sample its reward uniformly from [-3, 8)._

_The third arm shall sample its reward uniformly from the interval [2, 5)._

_The fourth arm shall sample its reward uniformly from [–2, 6)._

_The fifth arm shall sample its reward uniformly from [3, 4)._

_The sixth arm shall sample its reward uniformly from [-2, 2)._

_What is the expected reward when actions are chosen uniformly?_


**Answer:**
To find the expected reward when actions are chosen uniformly from the six arms of the bandit, we need to calculate the average of the expected rewards from each arm.

Let's denote the expected reward from each arm as follows:

- Arm 1: E1 = (1 + 3) / 2 = 2
- Arm 2: E2 = (-3 + 8) / 2 = 2.5
- Arm 3: E3 = (2 + 5) / 2 = 3.5
- Arm 4: E4 = (-2 + 6) / 2 = 2
- Arm 5: E5 = (3 + 4) / 2 = 3.5
- Arm 6: E6 = (-2 + 2) / 2 = 0

Now, we calculate the expected reward when actions are chosen uniformly:

Expected Reward = (E1 + E2 + E3 + E4 + E5 + E6) / 6

Substituting the values:

Expected Reward = (2 + 2.5 + 3.5 + 2 + 3.5 + 0) / 6
                = (13.5) / 6
                ≈ 2.25

So, the expected reward when actions are chosen uniformly from the six arms of the bandit is approximately 2.25.

## 2.2
_Implement the six-armed bandit from 2.1) and compute the sample average reward for 10 uniformly chosen actions!_

_Compare this to your expectation from 2.1)!_

In [22]:
import random

class Bandit:
    def __init__(self):
        self.arms = [
            {'min': 1, 'max': 3},  # Arm 1
            {'min': -3, 'max': 8},  # Arm 2
            {'min': 2, 'max': 5},  # Arm 3
            {'min': -2, 'max': 6},  # Arm 4
            {'min': 3, 'max': 4},  # Arm 5
            {'min': -2, 'max': 2}   # Arm 6
        ]

    def pull_arm(self, arm_index):
        arm = self.arms[arm_index]
        return random.uniform(arm['min'], arm['max'])

# Function to perform n uniformly chosen actions and compute the sample average reward
def sample_average_bandit(n):
    bandit = Bandit()
    total_reward = 0

    for _ in range(n):
        arm_index = random.randint(0, 5)  # Uniformly choose an arm index from 0 to 5
        reward = bandit.pull_arm(arm_index)
        total_reward += reward

    sample_avg_reward = total_reward / n
    return sample_avg_reward

# Compute the sample average reward for 10 uniformly chosen actions
sample_avg_reward = sample_average_bandit(10)
print("Sample average reward for 10 uniformly chosen actions:", sample_avg_reward)
print("It seems the result is more than what we predicted using average of expected results")

Sample average reward for 10 uniformly chosen actions: 2.6118713044264634
It seems the result is more than what we predicted using average of expected results


## 2.3

_Initialize Q(ai)=0 and chose 4000 actions according to an ε-greedy selection strategy (ε=0.1)!_

_Update your action values by computing the sample average reward of each action recursively!_

_For every 100 actions show the percentage of choosing arm 1, arm 2, arm 3, arm 4, arm 5, and arm 6 as well as the resulting average reward!_

In [19]:
class Bandit:
    def __init__(self):
        self.arms = [
            {'min': 1, 'max': 3},  # Arm 1
            {'min': -3, 'max': 8},  # Arm 2
            {'min': 2, 'max': 5},  # Arm 3
            {'min': -2, 'max': 6},  # Arm 4
            {'min': 3, 'max': 4},  # Arm 5
            {'min': -2, 'max': 2}   # Arm 6
        ]
        self.Q = [0] * len(self.arms)  # Initialize action values to 0
        self.N = [0] * len(self.arms)  # Initialize action counts to 0
        self.total_reward = 0

    def pull_arm(self, arm_index):
        arm = self.arms[arm_index]
        return random.uniform(arm['min'], arm['max'])

    def update_action_value(self, arm_index, reward):
        self.N[arm_index] += 1
        self.total_reward += reward
        self.Q[arm_index] += (reward - self.Q[arm_index]) / self.N[arm_index]

# Function to perform n actions using ε-greedy selection strategy
def epsilon_greedy_bandit(bandit, n, epsilon):
    arm_percentages = [0] * len(bandit.arms)
    avg_reward_history = []

    for i in range(1, n + 1):
        if random.random() < epsilon:  # Exploration: Choose a random arm
            arm_index = random.randint(0, len(bandit.arms) - 1)
        else:  # Exploitation: Choose the arm with the highest action value
            arm_index = bandit.Q.index(max(bandit.Q))

        reward = bandit.pull_arm(arm_index)
        bandit.update_action_value(arm_index, reward)

        # Update arm percentages
        total_pulls = sum(bandit.N)
        arm_percentages = [count / total_pulls for count in bandit.N]

        # Record average reward every 100 actions
        if i % 100 == 0:
            avg_reward = bandit.total_reward / i
            avg_reward_history.append(avg_reward)
            print(f"After {i} actions:")
            for j, percentage in enumerate(arm_percentages):
                print(f"Arm {j + 1}: {percentage:.2%}")
            print(f"Average Reward: {avg_reward:.2f}")
            print()

    return avg_reward_history

# Initialize the bandit
bandit = Bandit()

# Perform 4000 actions using ε-greedy strategy with ε=0.1
epsilon = 0.1
total_actions = 4000
avg_reward_history = epsilon_greedy_bandit(bandit, total_actions, epsilon)

After 100 actions:
Arm 1: 11.00%
Arm 2: 0.00%
Arm 3: 80.00%
Arm 4: 6.00%
Arm 5: 0.00%
Arm 6: 3.00%
Average Reward: 2.95

After 200 actions:
Arm 1: 7.00%
Arm 2: 0.50%
Arm 3: 76.00%
Arm 4: 4.00%
Arm 5: 10.00%
Arm 6: 2.50%
Average Reward: 3.17

After 300 actions:
Arm 1: 5.33%
Arm 2: 1.33%
Arm 3: 51.33%
Arm 4: 3.00%
Arm 5: 37.00%
Arm 6: 2.00%
Average Reward: 3.25

After 400 actions:
Arm 1: 4.25%
Arm 2: 1.25%
Arm 3: 39.25%
Arm 4: 2.75%
Arm 5: 50.75%
Arm 6: 1.75%
Average Reward: 3.30

After 500 actions:
Arm 1: 3.60%
Arm 2: 1.40%
Arm 3: 31.60%
Arm 4: 2.40%
Arm 5: 59.20%
Arm 6: 1.80%
Average Reward: 3.30

After 600 actions:
Arm 1: 3.33%
Arm 2: 1.50%
Arm 3: 26.67%
Arm 4: 2.17%
Arm 5: 64.50%
Arm 6: 1.83%
Average Reward: 3.32

After 700 actions:
Arm 1: 2.86%
Arm 2: 1.29%
Arm 3: 23.14%
Arm 4: 2.43%
Arm 5: 68.71%
Arm 6: 1.57%
Average Reward: 3.34

After 800 actions:
Arm 1: 2.88%
Arm 2: 1.75%
Arm 3: 20.50%
Arm 4: 2.38%
Arm 5: 70.88%
Arm 6: 1.62%
Average Reward: 3.33

After 900 actions:
Arm 1: 2.78%


## 2.4
_Redo the experiment, but after 2000 steps sample the rewards of the fourth arm uniformly from [5, 7) !_

_Compare updating action values by computing the sample average reward of each action recursively (as done in 2.3) with using a constant learning rate α=0.01 !_

_For every 100 actions show the percentage of choosing arm 1, arm 2, arm 3, arm 4, arm 5, and arm 6 as well as the resulting average reward!_

In [20]:
class Bandit:
    def __init__(self, change_arm_reward=False):
        self.arms = [
            {'min': 1, 'max': 3},  # Arm 1
            {'min': -3, 'max': 8},  # Arm 2
            {'min': 2, 'max': 5},  # Arm 3
            {'min': -2, 'max': 6},  # Arm 4 (initially)
            {'min': 3, 'max': 4},  # Arm 5
            {'min': -2, 'max': 2}   # Arm 6
        ]
        self.Q = [0] * len(self.arms)  # Initialize action values to 0
        self.N = [0] * len(self.arms)  # Initialize action counts to 0
        self.total_reward = 0
        self.change_arm_reward = change_arm_reward

    def pull_arm(self, arm_index):
        if self.change_arm_reward and sum(self.N) >= 2000 and arm_index == 3:
            # After 2000 steps, change rewards of arm 4
            return random.uniform(5, 7)
        else:
            arm = self.arms[arm_index]
            return random.uniform(arm['min'], arm['max'])

    def update_action_value_recursive(self, arm_index, reward):
        self.N[arm_index] += 1
        self.total_reward += reward
        self.Q[arm_index] += (reward - self.Q[arm_index]) / self.N[arm_index]

    def update_action_value_constant_lr(self, arm_index, reward, alpha=0.01):
        self.N[arm_index] += 1
        self.total_reward += reward
        self.Q[arm_index] += alpha * (reward - self.Q[arm_index])

# Function to perform n actions using ε-greedy selection strategy
def epsilon_greedy_bandit(bandit, n, epsilon, update_method='recursive'):
    arm_percentages = [0] * len(bandit.arms)
    avg_reward_history = []

    for i in range(1, n + 1):
        if random.random() < epsilon:  # Exploration: Choose a random arm
            arm_index = random.randint(0, len(bandit.arms) - 1)
        else:  # Exploitation: Choose the arm with the highest action value
            arm_index = bandit.Q.index(max(bandit.Q))

        reward = bandit.pull_arm(arm_index)
        
        if update_method == 'recursive':
            bandit.update_action_value_recursive(arm_index, reward)
        elif update_method == 'constant_lr':
            bandit.update_action_value_constant_lr(arm_index, reward)

        # Update arm percentages
        total_pulls = sum(bandit.N)
        arm_percentages = [count / total_pulls for count in bandit.N]

        # Record average reward every 100 actions
        if i % 100 == 0:
            avg_reward = bandit.total_reward / i
            avg_reward_history.append(avg_reward)
            print(f"After {i} actions:")
            for j, percentage in enumerate(arm_percentages):
                print(f"Arm {j + 1}: {percentage:.2%}")
            print(f"Average Reward: {avg_reward:.2f}")
            print()

    return avg_reward_history

# Initialize the bandit
bandit_recursive = Bandit(change_arm_reward=True)
bandit_constant_lr = Bandit(change_arm_reward=True)

# Perform 4000 actions using ε-greedy strategy with ε=0.1 and update action values recursively
epsilon = 0.1
total_actions = 4000
print("Using recursive update method:")
avg_reward_history_recursive = epsilon_greedy_bandit(bandit_recursive, total_actions, epsilon, update_method='recursive')

# Perform 4000 actions using ε-greedy strategy with ε=0.1 and update action values with constant learning rate
print("Using constant learning rate update method:")
avg_reward_history_constant_lr = epsilon_greedy_bandit(bandit_constant_lr, total_actions, epsilon, update_method='constant_lr')


Using recursive update method:
After 100 actions:
Arm 1: 10.00%
Arm 2: 5.00%
Arm 3: 83.00%
Arm 4: 1.00%
Arm 5: 1.00%
Arm 6: 0.00%
Average Reward: 3.34

After 200 actions:
Arm 1: 5.50%
Arm 2: 3.00%
Arm 3: 87.50%
Arm 4: 2.00%
Arm 5: 1.00%
Arm 6: 1.00%
Average Reward: 3.26

After 300 actions:
Arm 1: 4.67%
Arm 2: 2.33%
Arm 3: 89.33%
Arm 4: 2.00%
Arm 5: 0.67%
Arm 6: 1.00%
Average Reward: 3.29

After 400 actions:
Arm 1: 4.00%
Arm 2: 2.00%
Arm 3: 87.25%
Arm 4: 1.50%
Arm 5: 4.25%
Arm 6: 1.00%
Average Reward: 3.30

After 500 actions:
Arm 1: 3.20%
Arm 2: 1.80%
Arm 3: 88.80%
Arm 4: 1.40%
Arm 5: 3.80%
Arm 6: 1.00%
Average Reward: 3.32

After 600 actions:
Arm 1: 2.83%
Arm 2: 1.83%
Arm 3: 82.17%
Arm 4: 1.17%
Arm 5: 10.50%
Arm 6: 1.50%
Average Reward: 3.30

After 700 actions:
Arm 1: 2.86%
Arm 2: 2.29%
Arm 3: 75.57%
Arm 4: 1.57%
Arm 5: 16.29%
Arm 6: 1.43%
Average Reward: 3.30

After 800 actions:
Arm 1: 2.75%
Arm 2: 2.25%
Arm 3: 77.00%
Arm 4: 1.62%
Arm 5: 14.88%
Arm 6: 1.50%
Average Reward: 3.31

After

## 2.5
_Modify your implementation by using an optimistic initialization Q(ai)=5 and a greedy action selection strategy, still using a constant learning rate α=0.01 !_

_For every 100 actions show the percentage of choosing arm 1, arm 2, arm 3, arm 4, arm 5, and arm 6 as well as the resulting average reward !_

_Compare this to your result from 2.4)_

In [21]:
import random

class Bandit:
    def __init__(self):
        self.arms = [
            {'min': 1, 'max': 3},  # Arm 1
            {'min': -3, 'max': 8},  # Arm 2
            {'min': 2, 'max': 5},  # Arm 3
            {'min': -2, 'max': 6},  # Arm 4
            {'min': 3, 'max': 4},  # Arm 5
            {'min': -2, 'max': 2}   # Arm 6
        ]
        self.Q = [5] * len(self.arms)  # Optimistic initialization: Set initial action values to 5
        self.N = [0] * len(self.arms)  # Initialize action counts to 0
        self.total_reward = 0

    def pull_arm(self, arm_index):
        arm = self.arms[arm_index]
        return random.uniform(arm['min'], arm['max'])

    def update_action_value_constant_lr(self, arm_index, reward, alpha=0.01):
        self.N[arm_index] += 1
        self.total_reward += reward
        self.Q[arm_index] += alpha * (reward - self.Q[arm_index])

# Function to perform n actions using greedy action selection strategy and constant learning rate
def greedy_bandit(bandit, n, alpha=0.01):
    arm_percentages = [0] * len(bandit.arms)
    avg_reward_history = []

    for i in range(1, n + 1):
        arm_index = bandit.Q.index(max(bandit.Q))  # Greedy action selection: Choose the arm with the highest action value
        reward = bandit.pull_arm(arm_index)
        bandit.update_action_value_constant_lr(arm_index, reward)

        # Update arm percentages
        total_pulls = sum(bandit.N)
        arm_percentages = [count / total_pulls for count in bandit.N]

        # Record average reward every 100 actions
        if i % 100 == 0:
            avg_reward = bandit.total_reward / i
            avg_reward_history.append(avg_reward)
            print(f"After {i} actions:")
            for j, percentage in enumerate(arm_percentages):
                print(f"Arm {j + 1}: {percentage:.2%}")
            print(f"Average Reward: {avg_reward:.2f}")
            print()

    return avg_reward_history

# Initialize the bandit
bandit = Bandit()

# Perform 4000 actions using greedy action selection strategy with constant learning rate α=0.01
total_actions = 4000
print("Using greedy action selection strategy:")
avg_reward_history_greedy = greedy_bandit(bandit, total_actions)


Using greedy action selection strategy:
After 100 actions:
Arm 1: 12.00%
Arm 2: 19.00%
Arm 3: 25.00%
Arm 4: 11.00%
Arm 5: 26.00%
Arm 6: 7.00%
Average Reward: 2.81

After 200 actions:
Arm 1: 12.00%
Arm 2: 14.50%
Arm 3: 27.00%
Arm 4: 12.50%
Arm 5: 27.00%
Arm 6: 7.00%
Average Reward: 2.83

After 300 actions:
Arm 1: 11.00%
Arm 2: 13.67%
Arm 3: 30.33%
Arm 4: 11.00%
Arm 5: 27.67%
Arm 6: 6.33%
Average Reward: 2.91

After 400 actions:
Arm 1: 10.00%
Arm 2: 15.50%
Arm 3: 29.25%
Arm 4: 10.75%
Arm 5: 28.50%
Arm 6: 6.00%
Average Reward: 2.92

After 500 actions:
Arm 1: 9.60%
Arm 2: 17.00%
Arm 3: 29.00%
Arm 4: 9.60%
Arm 5: 29.40%
Arm 6: 5.40%
Average Reward: 2.93

After 600 actions:
Arm 1: 9.17%
Arm 2: 15.00%
Arm 3: 30.67%
Arm 4: 8.67%
Arm 5: 31.50%
Arm 6: 5.00%
Average Reward: 2.94

After 700 actions:
Arm 1: 8.29%
Arm 2: 13.29%
Arm 3: 32.43%
Arm 4: 8.00%
Arm 5: 33.43%
Arm 6: 4.57%
Average Reward: 2.99

After 800 actions:
Arm 1: 7.62%
Arm 2: 11.75%
Arm 3: 36.62%
Arm 4: 7.12%
Arm 5: 32.75%
Arm 6: 4.12