# Implementing The Multi-Armed Bandit Algorithm on EV3 Mindstorm

## Introduction

In this notebook, we'll explore how to implement the Multi-Armed Bandit Algorithm on an EV3 Mindstorm robot. The Multi-Armed Bandit problem is a classic problem in probability theory and statistics, often used as a simplified model for optimizing resource allocation. EV3 Mindstorm, on the other hand, is a programmable robot kit that provides a hands-on introduction to robotics and programming.

## Table of Contents

1. [What is a Multi-Armed Bandit?](#What-is-a-Multi-Armed-Bandit?)
2. [Importance of Multi-Armed Bandit](#Importance-of-Multi-Armed-Bandit)
3. [Drawbacks and Limitations](#Drawbacks-and-Limitations)
4. [Real-world Applications](#Real-world-Applications)
5. [Implementing on EV3 Mindstorm](#Implementing-on-EV3-Mindstorm)
6. [Exercises](#Exercises)
7. [Conclusion](#Conclusion)

## What is a Multi-Armed Bandit?

Imagine you're in a casino, and you're faced with a row of slot machines, also known as 'one-armed bandits.' Each machine provides a different but unknown reward when you play it. Your goal is to maximize your total reward over a series of rounds. This is the essence of the Multi-Armed Bandit problem. In a more formal setting, you have multiple options (the arms), each providing a different but unknown reward. Your task is to devise a strategy to maximize your total reward over a series of trials.

In the context of EV3 Mindstorm, think of each 'arm' as a different path or action the robot could take. Each path has an associated reward, like reaching the destination faster or avoiding obstacles more efficiently. The Multi-Armed Bandit algorithm can help the robot learn the best path to take over time.

![Multi-Armed Bandit](https://miro.medium.com/max/1400/1*2r3xYjppoUO1jqHXHcE2_Q.jpeg)

Image Source: [Medium Article by Fedor Parfenov](https://medium.com/expedia-group-tech/how-we-optimized-hero-images-on-hotels-com-using-multi-armed-bandit-algorithms-4503c2c32eae)

## Importance of Multi-Armed Bandit

The Multi-Armed Bandit algorithm is not just a theoretical construct; it has practical implications in various fields. For instance, in online advertising, it can be used to determine which ad to display to maximize click-through rates. In healthcare, it can be used to allocate limited resources, such as deciding which treatment to give to a patient to maximize the likelihood of recovery.

For our EV3 Mindstorm robot, the importance lies in efficient learning and decision-making. Traditional methods might require the robot to try all paths multiple times before deciding the best one, which is time-consuming and may wear out the hardware. The Multi-Armed Bandit algorithm allows the robot to learn more efficiently, saving both time and resources.

![Importance of Multi-Armed Bandit](https://miro.medium.com/max/1400/1*2r3xYjppoUO1jqHXHcE2_Q.jpeg)

Image Source: [Medium Article by Fedor Parfenov](https://medium.com/expedia-group-tech/how-we-optimized-hero-images-on-hotels-com-using-multi-armed-bandit-algorithms-4503c2c32eae)

## Drawbacks and Limitations

While the Multi-Armed Bandit algorithm is powerful, it's not without its drawbacks. One of the main limitations is the assumption that the reward distributions are stationary, meaning they do not change over time. In many real-world scenarios, this is not the case. For example, the best path for the EV3 Mindstorm robot may change due to external factors like new obstacles.

Another drawback is the computational complexity involved in updating the reward estimates, especially as the number of arms increases. This can be a concern for devices with limited computational resources, like the EV3 Mindstorm.

![Drawbacks of Multi-Armed Bandit](https://miro.medium.com/max/1400/1*2r3xYjppoUO1jqHXHcE2_Q.jpeg)

Image Source: [Medium Article by Fedor Parfenov](https://medium.com/expedia-group-tech/how-we-optimized-hero-images-on-hotels-com-using-multi-armed-bandit-algorithms-4503c2c32eae)

## Real-world Applications

The Multi-Armed Bandit algorithm has found applications in various domains:

- **Online Advertising**: To optimize the selection of ads to display.
- **Healthcare**: For personalized medicine and treatment recommendations.
- **Finance**: In trading algorithms to maximize returns.
- **Robotics**: For path optimization, as we will see with the EV3 Mindstorm.

In the context of our EV3 Mindstorm robot, the algorithm could be used in a maze-solving scenario where the robot has to find the quickest path to the exit. It could also be used in a search-and-rescue operation where the robot has to find the most efficient path to reach people in need.

![Real-world Applications of Multi-Armed Bandit](https://miro.medium.com/max/1400/1*2r3xYjppoUO1jqHXHcE2_Q.jpeg)

Image Source: [Medium Article by Fedor Parfenov](https://medium.com/expedia-group-tech/how-we-optimized-hero-images-on-hotels-com-using-multi-armed-bandit-algorithms-4503c2c32eae)

## Implementing on EV3 Mindstorm

Now that we have a good understanding of the Multi-Armed Bandit algorithm and its applications, let's dive into its implementation on an EV3 Mindstorm robot.

### Problem Statement

Let's assume our EV3 Mindstorm robot is placed in a maze with multiple paths. Each path leads to the exit but with varying levels of difficulty and time required. The robot has to find the most efficient path to the exit.

### Algorithm Steps

1. Initialize estimated rewards for each path to zero.
2. For each round:
    1. Select a path based on estimated rewards (using Epsilon-Greedy or another strategy).
    2. Navigate the robot through the selected path.
    3. Observe the actual reward (time taken, obstacles encountered, etc.).
    4. Update the estimated reward for the selected path.

Let's proceed to code this algorithm.

In [None]:
import numpy as np
import time

# Simulated rewards for each path (in seconds; lower is better)
true_rewards = [10, 8, 12, 7]

# Initialize estimated rewards
estimated_rewards = [0, 0, 0, 0]
n_trials = [0, 0, 0, 0]

# Epsilon-Greedy strategy
epsilon = 0.1

# Number of rounds
n_rounds = 20

# Simulate the rounds
for i in range(n_rounds):
    if np.random.rand() < epsilon:
        # Explore: choose a random path
        path = np.random.randint(0, 4)
    else:
        # Exploit: choose the path with the lowest estimated time
        path = np.argmin(estimated_rewards)

    # Simulate navigating the robot through the selected path
    actual_reward = np.random.normal(true_rewards[path], 1)

    # Update estimated reward for the selected path
    n_trials[path] += 1
    estimated_rewards[path] = ((n_trials[path] - 1) * estimated_rewards[path] + actual_reward) / n_trials[path]

    print(f'Round {i+1}: Selected path {path+1}, Actual time {actual_reward:.2f}s, Estimated time {estimated_rewards[path]:.2f}s')

    # Simulate a delay (e.g., for the robot to navigate)
    time.sleep(1)

## Code Explanation

In the above code, we simulated the scenario where an EV3 Mindstorm robot has to choose between four paths in a maze. Each path has a 'true' average time to navigate, which is unknown to the robot. The robot uses an Epsilon-Greedy strategy to choose paths over 20 rounds.

Here's a breakdown of the code:

1. **Initialization**: We initialize the true rewards for each path and set the estimated rewards to zero.
2. **Epsilon-Greedy Strategy**: We use an epsilon of 0.1, meaning the robot will explore a random path 10% of the time and exploit the best-known path 90% of the time.
3. **Simulation Loop**: For each round, the robot chooses a path based on the Epsilon-Greedy strategy, simulates navigating through it, and updates the estimated reward for that path.

The output shows the path chosen in each round, the actual time taken, and the updated estimated time for that path.

## Exercises

### Exercise 1: Different Strategies

Modify the code to implement a different strategy for arm selection, such as UCB (Upper Confidence Bound). Compare the results with the Epsilon-Greedy strategy.

### Exercise 2: Non-Stationary Rewards

Modify the code to simulate a non-stationary environment, where the true rewards for each path change over time. How does the algorithm perform?

### Exercise 3: Real-world Testing

If you have access to an EV3 Mindstorm robot, implement the algorithm and test it in a real-world scenario. Share your observations and results.

## Code Explanation

Let's break down the code to understand how it works:

1. **Initialization**: We initialize the true rewards for each path and set the estimated rewards to zero. The true rewards are simulated to represent the time taken for each path. Lower is better.
2. **Epsilon-Greedy Strategy**: We use an Epsilon-Greedy strategy with \(\epsilon = 0.1\). This means that 10% of the time, the robot will explore a random path, and 90% of the time, it will exploit the best-known path.
3. **Simulation Loop**: We run a loop for a fixed number of rounds. In each round, the robot selects a path based on the Epsilon-Greedy strategy, simulates navigating through it, and updates the estimated reward for that path.

The `time.sleep(1)` at the end simulates the time it would take for the robot to navigate the path. In a real-world implementation, this would be replaced by the actual navigation code.

## Exercises

### Exercise 1: Different Strategies

Modify the code to implement different strategies like UCB (Upper Confidence Bound) and Thompson Sampling. Compare the performance with Epsilon-Greedy.

### Exercise 2: Non-Stationary Rewards

Modify the code to simulate non-stationary rewards (i.e., rewards that change over time). How does this affect the performance of the algorithm?

### Exercise 3: Real-world Testing

Implement the algorithm on an actual EV3 Mindstorm robot and test it in a real-world scenario. Compare the results with the simulation.

## Solutions to Exercises

### Solution to Exercise 1: Different Strategies

To implement the UCB (Upper Confidence Bound) strategy, you can modify the arm selection part of the code as follows:

```python
import math

# Initialize upper confidence bounds
ucbs = [float('inf')] * 4

# In the simulation loop
for i in range(n_rounds):
    path = np.argmax(ucbs)
    # ... (rest of the code remains the same)
    # Update UCB for the selected path
    ucbs[path] = estimated_rewards[path] + math.sqrt(2 * math.log(i+1) / n_trials[path])
```

### Solution to Exercise 2: Non-Stationary Rewards

To simulate a non-stationary environment, you can add a random noise to the true rewards at each round. For example:

```python
true_rewards = np.random.normal(true_rewards, 0.1)
```

### Solution to Exercise 3: Real-world Testing

For real-world testing on an EV3 Mindstorm robot, you would need to replace the simulated reward with the actual time taken by the robot to navigate through the selected path. You can use the EV3 Python API to control the robot and measure the time.

## Conclusion

In this notebook, we explored the Multi-Armed Bandit algorithm, its importance, drawbacks, and real-world applications. We then implemented a simple version of the algorithm to solve a maze navigation problem for an EV3 Mindstorm robot. Through exercises and solutions, we also looked at how to adapt the algorithm for different strategies and environments.

The Multi-Armed Bandit algorithm offers a robust framework for solving complex decision-making problems, and its implementation on platforms like the EV3 Mindstorm opens up exciting possibilities for robotics and automation.

## Exercise Solutions

### Solution to Exercise 1: Different Strategies

For UCB (Upper Confidence Bound), the selection strategy can be modified as follows:

```python
import math

# Initialize
ucb_values = [0, 0, 0, 0]

# In the simulation loop
if i == 0:
    path = np.random.randint(0, 4)
else:
    path = np.argmax(ucb_values)

# After observing the reward
ucb_values[path] = estimated_rewards[path] + math.sqrt(2 * math.log(i+1) / n_trials[path])
```

For Thompson Sampling, you would maintain a Beta distribution for each path and sample from it to select a path.

### Solution to Exercise 2: Non-Stationary Rewards

To simulate non-stationary rewards, you can add a time-dependent factor to the `true_rewards`. For example:

```python
true_rewards = [10 + np.sin(i/10), 8, 12, 7]
```

To adapt to non-stationary rewards, you can use a sliding window to update the estimated rewards.

### Solution to Exercise 3: Real-world Testing

For real-world testing, you would replace the simulated `actual_reward` with the time taken by the EV3 Mindstorm robot to navigate the selected path. The rest of the algorithm remains the same.

## Conclusion

In this notebook, we explored the Multi-Armed Bandit algorithm and its implementation on an EV3 Mindstorm robot. We discussed the importance, drawbacks, and real-world applications of the algorithm. We also provided exercises and their solutions to deepen your understanding and encourage further exploration.

The Multi-Armed Bandit algorithm offers a robust and efficient way for the EV3 Mindstorm robot to learn and make decisions. While the notebook provides a simulated environment, the real test would be to implement it on an actual robot. So go ahead, take your EV3 Mindstorm robot for a spin and let it learn the ways of the Multi-Armed Bandit!