<center>

# Matthew Pronyshyn 1002365978

</center>


In [7]:
import numpy as np
import scipy.stats as stats

# Number of bandits
num_bandits = 3
bandit_names = ['Red', 'Blue', 'Other']  # Names for the bandits

# Fixed probabilities of success for each bandit
true_probabilities = {
    'Red': 0.25,  # 25% chance of success
    'Blue': 0.5,  # 50% chance of success
    'Other': 0.75  # 75% chance of success
}

# Prior parameters for the Beta distribution (uninformative uniform prior)
alpha = np.ones(num_bandits)
beta = np.ones(num_bandits)

# Update function for posterior
def update_posterior(bandit_index, result):
    if result: # If the result was a success
        alpha[bandit_index] += 1
    else: # If the result was a failure
        beta[bandit_index] += 1

# Choose bandit function
def choose_bandit():
    sampled_theta = [stats.beta.rvs(a, b) for a, b in zip(alpha, beta)]
    return np.argmax(sampled_theta)

# Perform trial function
def perform_trial(bandit_index):
    bandit = bandit_names[bandit_index]
    # Simulate a Bernoulli trial using the true probability of success
    return stats.bernoulli(p=true_probabilities[bandit]).rvs(size=1)

# Main experiment loop
number_of_trials = 100  # Define the number of trials
for t in range(number_of_trials):
    chosen_bandit_index = choose_bandit()
    result = perform_trial(chosen_bandit_index)
    update_posterior(chosen_bandit_index, result)

    print(f'After trial {t+1}, bandit {bandit_names[chosen_bandit_index]}: Alpha {alpha[chosen_bandit_index]}, Beta {beta[chosen_bandit_index]}, Expectation {round(alpha[chosen_bandit_index]/(alpha[chosen_bandit_index]+beta[chosen_bandit_index]),2)}')
post_expectations = alpha/(alpha+beta)
print(f'True/Posterior expectations: "Red" {true_probabilities["Red"]}/{round(post_expectations[0],2)}, "Blue" {true_probabilities["Blue"]}/{round(post_expectations[1],2)}, "Other" {true_probabilities["Other"]}/{round(post_expectations[2],2)}')
print(f'Bandit "{bandit_names[np.argmax(post_expectations)]}" has the highest posterior expectation.')

After trial 1, bandit Other: Alpha 1.0, Beta 2.0, Expectation 0.33
After trial 2, bandit Blue: Alpha 2.0, Beta 1.0, Expectation 0.67
After trial 3, bandit Blue: Alpha 3.0, Beta 1.0, Expectation 0.75
After trial 4, bandit Blue: Alpha 3.0, Beta 2.0, Expectation 0.6
After trial 5, bandit Blue: Alpha 4.0, Beta 2.0, Expectation 0.67
After trial 6, bandit Blue: Alpha 5.0, Beta 2.0, Expectation 0.71
After trial 7, bandit Blue: Alpha 5.0, Beta 3.0, Expectation 0.62
After trial 8, bandit Blue: Alpha 6.0, Beta 3.0, Expectation 0.67
After trial 9, bandit Blue: Alpha 7.0, Beta 3.0, Expectation 0.7
After trial 10, bandit Blue: Alpha 8.0, Beta 3.0, Expectation 0.73
After trial 11, bandit Blue: Alpha 8.0, Beta 4.0, Expectation 0.67
After trial 12, bandit Red: Alpha 1.0, Beta 2.0, Expectation 0.33
After trial 13, bandit Other: Alpha 1.0, Beta 3.0, Expectation 0.25
After trial 14, bandit Red: Alpha 1.0, Beta 3.0, Expectation 0.25
After trial 15, bandit Blue: Alpha 9.0, Beta 4.0, Expectation 0.69
After 

The algorithm for the multi-armed Bayesian bandit problem begins by establishing a Beta distribution for each bandit with uninformative uniform priors, where both alpha and beta parameters are set to 1. This reflects an initial state of equal likelihood for success across all bandits. Trials are conducted where a bandit is chosen and its outcome, success or failure, is observed. These outcomes are based on predetermined probabilities unique to each bandit, simulating a Bernoulli process.

After each trial, the algorithm updates its belief about the chosen bandit's probability of success. This is done by adjusting the bandit's Beta posterior distribution hyperparameters: alpha is increased by 1 for a success, and beta is increased by 1 for a failure. The decision on which bandit to select for each trial is based on sampling from the posterior distributions of all bandits. The bandit with the highest sampled posterior value is chosen, a strategy that balances exploration and exploitation.

Over successive trials, the algorithm refines its understanding of each bandit's success probability. This continuous updating leads to more accurate predictions on the exploited bandit and, generally, a preference for the bandit with the highest actual probability of success if the true probabilities are not very close.