#Epsilon-Greedy simple implementation:

This code initializes the variables needed for the algorithm, loops through a specified number of plays, and performs the Epsilon-Greedy action selection for each play. 

It then simulates a reward for the chosen arm and updates the Q-value and N-value for that arm based on the reward. Finally, it prints the final Q-values after all plays have been completed.

    Initialize a list of N arms, each with an unknown reward probability.

    Set a value for epsilon, which represents the probability of exploration. For example, if epsilon = 0.1, then 10% of the time we will explore a random arm, and 90% of the time we will choose the arm with the highest estimated reward.

    For each round or iteration:
    a. With probability epsilon, choose a random arm to explore.
    b. Otherwise, choose the arm with the highest estimated reward.
    c. Observe the reward from the chosen arm.
    d. Update the estimated reward for the chosen arm based on the observed reward.

    Repeat step 3 for a fixed number of rounds or until convergence.

Here are some variables you will need to use:

    N: the number of arms
    epsilon: the probability of exploration
    Q: a list of length N to store the estimated reward for each arm
    N_pulls: a list of length N to store the number of times each arm has been pulled
    a function to calculate the average reward for each arm

To implement step 3d, you will need to update the estimated reward for the chosen arm using the following formula:
Q[a] = Q[a] + (r - Q[a]) / N_pulls[a]

where a is the index of the chosen arm, r is the observed reward, and N_pulls[a] is the number of times the arm has been pulled before.

In [None]:
import random

# Initialize variables
num_arms = 10
epsilon = 0.1
q_values = [0] * num_arms
n_values = [0] * num_arms

# Loop for each play
for play in range(1000):

    # Epsilon-Greedy action selection
    if random.uniform(0, 1) < epsilon:
        # Choose a random arm
        arm = random.randint(0, num_arms-1)
    else:
        # Choose the arm with the highest Q-value
        arm = q_values.index(max(q_values))

    # Simulate reward for chosen arm
    reward = random.gauss(0, 1)

    # Update Q-value and N-value for chosen arm
    n_values[arm] += 1
    q_values[arm] += (1/n_values[arm]) * (reward - q_values[arm])

# Print final Q-values
print(q_values)
