# Gaussian Bandit (Slot Machine) Simulator

In this notebook, first we'll create a Python class called `GaussianBandi` to simulate a set of bandit machines, also known as slot machines, which provide rewards following a Gaussian (Normal) distribution. The GaussianBandit class will allow us to model the uncertainty and randomness associated with each machine's reward distribution. Each instance of the `GaussianBandit` class represents a different bandit machine. By creating multiple instances with different mean and standard deviation values, we can simulate a set of bandit machines, each having its own reward distribution.

## Multi-armed Bandit Problem
This code allows us to model the environment of the "Multi-Armed Bandit" problem. In this classic problem, an agent (player) is faced with multiple bandit machines and must decide which machine to pull in each round to maximize their total reward over time.

The challenge arises from the uncertainty in each machine's reward distribution. As rewards are generated stochastically, the agent cannot know the true mean reward of each machine initially. The agent's objective is to learn and adapt its strategy over time to make better decisions, balancing exploration (trying different machines) and exploitation (choosing machines that appear to have high rewards based on current information).

### Code Description

#### Constructor:
- The `GaussianBandit` class has a constructor `__init__(self, mean=0, stdev=1)`, which takes two optional parameters:
  - `mean`: The mean (average) of the Gaussian distribution representing the machine's reward. Default value is 0.
  - `stdev`: The standard deviation of the Gaussian distribution, which controls the spread or variability of the rewards. Default value is 1.

#### Method: `pull_lever()`
- The `pull_lever()` method simulates pulling the lever of the bandit machine and returns a reward.
- It generates a random reward from the Gaussian distribution with the specified mean and standard deviation using NumPy's `np.random.normal()` function.
- The reward is then rounded to one decimal place to provide a realistic output, as slot machines often have discrete reward values.

In [1]:
import numpy as np

class GaussianBandit(object):
    def __init__(self, mean=0, stdev=1):
        self.mean = mean
        self.stdev = stdev

    def pull_lever(self):
        reward = np.random.normal(self.mean, self.stdev)
        return np.round(reward, 1)

# Slot Machine Interaction

We'll expand upon the previously defined `GaussianBandit` class and introduce a new class called `GaussianBanditGame`. This new class simulates a game environment in which the player interacts with a set of bandit machines, making choices to pull levers and observe rewards.

### Code Description

#### Constructor:
- The `GaussianBanditGame` class has a constructor `__init__(self, bandits)`, which takes a list of `GaussianBandit` instances as input. These instances represent the different bandit machines available in the game.
- The constructor shuffles the list of bandit machines to randomize their order for the game.

#### Method: `play(choice)`
- The `play(choice)` method allows the player to pull the lever of a chosen bandit machine and obtain a reward.
- The `choice` parameter indicates the index of the selected bandit machine (1-based index).
- The method returns the reward obtained from the selected machine.

#### Method: `user_play()`
- The `user_play()` method initiates the game and allows the user to interact with the bandit machines.
- It starts the game loop where the user can make choices to pull levers and observe rewards.
- After each round, the method displays the obtained reward and the player's average reward so far.
- The game continues until the user enters an invalid choice (not within the valid range) or inputs 0 to end the game.
- Upon ending the game, it displays the total reward earned and the average reward per round.

#### Method: reset_game()
- The `reset_game()` method resets the game's internal state, clearing previous rewards and statistics, and prepares the game for a new session.

In [14]:
class GaussianBanditGame(object):
    def __init__(self, bandits):
        self.bandits = bandits
        np.random.shuffle(self.bandits)
        self.reset_game()

    def play(self, choice):
        reward = self.bandits[choice - 1].pull_lever()
        self.rewards.append(reward)
        self.total_reward += reward
        self.n_played += 1
        return reward
    
    def user_play(self):
        self.reset_game()
        print("Game started. Enter 0 as input to end the game")
        while True:
            print(f"\n -- Round {self.n_played} -- ")
            choice = int(input(f"Choose a machine from 1 to {len(self.bandits)}: "))
            if choice in range(1, len(self.bandits) + 1):
                reward = self.play(choice)
                print(f"Machine {choice} gave a reward of {reward}")
                avg_rew = self.total_reward / self.n_played
                print(f"Your average reward so far is {avg_rew}")
            else:
                break
        print("Game has ended.")
        if self.n_played > 0:
            print(f"Total reward is {self.total_reward} after {self.n_played} round(s).")
            avg_rew = self.total_reward / self.n_played
            print(f"Average reward is {avg_rew}.")

    def reset_game(self):
        self.rewards = []
        self.total_reward = 0
        self.n_played = 0

In [17]:
slotA = GaussianBandit(5, 3)
slotB = GaussianBandit(6, 2)
slotC = GaussianBandit(1, 5)
game = GaussianBanditGame([slotA, slotB, slotC])

In [18]:
game.user_play()

Game started. Enter 0 as input to end the game

 -- Round 0 -- 
Machine 2 gave a reward of 5.7
Your average reward so far is 5.7

 -- Round 1 -- 
Machine 2 gave a reward of 7.4
Your average reward so far is 6.550000000000001

 -- Round 2 -- 
Machine 2 gave a reward of 2.7
Your average reward so far is 5.266666666666667

 -- Round 3 -- 
Machine 2 gave a reward of 1.8
Your average reward so far is 4.4

 -- Round 4 -- 
Machine 2 gave a reward of 2.7
Your average reward so far is 4.0600000000000005

 -- Round 5 -- 
Game has ended.
Total reward is 20.3 after 5 round(s).
Average reward is 4.0600000000000005.


The Multi-Armed Bandit problem involves the challenge of choosing between multiple slot machines (bandit machines) that provide random rewards, without knowing the true reward distributions. The player's objective is to maximize their total reward over multiple rounds of play by balancing exploration (trying different machines to learn about rewards) and exploitation (choosing the machine that seems best based on current information).

In this example, we observe that early rewards may not accurately represent the true average reward of a machine. Thus, continuous exploration is vital to make informed decisions. The game highlights the importance of data-driven strategies and continuous learning to adapt to uncertainty and stochasticity in the bandit machines.

Ultimately, the Multi-Armed Bandit problem serves as a fundamental challenge in decision-making under uncertainty, with applications in various real-world scenarios, including clinical trials, recommendation systems, online advertising, and resource allocation in machine learning. Efficiently addressing this problem requires finding the right balance between exploration and exploitation to maximize rewards and resources over time.

**Key Points:**

- `Unknown Reward Distributions:` The player does not have prior knowledge of the true reward distributions associated with each machine. As a result, the player needs to explore the machines to learn their reward characteristics.
- `Stochastic Rewards:` When the player pulls the lever of a machine, they receive a reward sampled from the machine's reward distribution. These rewards are stochastic, meaning they are subject to randomness and can vary from round to round.
- `Exploration vs. Exploitation:` The player faces a trade-off between exploration and exploitation. Exploration involves trying out different machines to gather information about their rewards. Exploitation involves choosing the machine that appears to have the highest reward based on the current knowledge.
- `Balancing Strategies:` Striking the right balance between exploration and exploitation is crucial for optimizing the total reward over time. Early in the game, the player may explore more to gather data on the machines' rewards, and as more information is acquired, the player may shift toward exploiting the best machine more frequently.
- `Continuous Learning:` The player needs to continuously adapt their strategy as they gain more data and update their knowledge about the machines' reward distributions.
- `Convergence to Best Machine:` Over time, with enough exploration and exploitation, the player aims to converge on the machine that provides the highest average reward, maximizing their overall earnings.

The Multi-Armed Bandit problem has applications in various real-world scenarios, such as clinical trials, recommendation systems, online advertising, and resource allocation in machine learning. Efficiently addressing this problem is essential in scenarios where making the best choice from uncertain options can lead to significant gains in rewards, resources, or outcomes.