# Machine Learning SoSe21 Practice Class

Dr. Timo Baumann, Dr. Özge Alaçam, Björn Sygo <br>
Email: baumann@informatik.uni-hamburg.de, alacam@informatik.uni-hamburg.de, 6sygo@informatik.uni-hamburg.de

## Exercise 7
**Description:** Implement and simulate a multi-armed bandit  <br>
**Deadline:** Saturday, 03. July 2021, 23:59 <br>
**Working together:** You can work in pairs or triples but no larger teams are allowed. <br>
&emsp;&emsp;&emsp; &emsp; &emsp; &emsp; &emsp; Please adhere to the honor code discussed in class. <br>
&emsp;&emsp;&emsp; &emsp; &emsp; &emsp; &emsp; All members of the team must get involved in understanding and coding the solution.

## Submission: 
**Put your names here**

*Also put high-level comments that should be read before looking at your code and results.*

### Goal
In this exercise, you will implement basic RL techniques for the multi-armed bandit problem. Unlike in the previous tasks, this one does not require _data_ but a specification of a _World_, an _Agent_ and a simulation environment that supervises the performance of the agent in the world and measures its succcess. In other words: you first need to design your experimentation environment.

### Design the interfaces of your simulation environment
You want to separate the agent and the world and specify the interfaces of the agent and the world that it interacts with. You will then supervise the interaction in your simulation environment.

**Task 1** (15%):
To this end, define the interfaces for:
 * a k-armed bandit world (each arm's reward is normally distributed with $\sigma=1$ and each mean is chosen normally with $\mu=1$ and $\sigma=1$), 
 * a k-armed bandit agent that plays in the world and is able to be informed about the rewards for its actions.
Note that it may be easier in Python to specify classes (with minimal behaviour) rather than interfaces. For example, your minimal agent could simply pull one of the arms at random.

**Task 2** (10%): Define a simulation environment implementation that orchestrates the interplay between the agent and the world for a given number of action-reward rounds while keeping track of the relevant performance metrics that later need to be visualized (see below).

Results of individual runs of the simulation will be very noisy. Therefore, define a way to repeatedly play the simulation (say, 2000 times) and average the performance metrics across these episodes.

Perform your experiments with a $k=10$-armed bandit and episodes of $N=1000$ action-reward rounds, but make sure that your implementation also works for different $k$ and $N$.

Note: analyse the requirements of the simulation environment wrt. the tasks specified below to ensure that you do all that is necessary (and not too much else).

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [0]:
import random

class BaseWorld:
    def __init__(self, k):
        self.arms = [random.gauss(mu=1, sigma=1) for _ in range(k)]
    
    def get_reward(self, arm):
        return random.gauss(mu=self.arms[arm], sigma=1)

    def nr_arms(self):
        return len(self.arms)

    def after_iteration(self):
        pass

class BaseAgent:
    def __init__(self, world):
        self.world = world
        self.possible_actions = world.nr_arms()

    def initialize(self):
        self.weights = [1 / self.possible_actions] * self.possible_actions

    def perform_action(self):
        action = random.choice(range(self.possible_actions), weights=self.weights, k=1)
        reward = self.world.get_reward(action)
        self.reward = reward
        return reward, max(self.weights)

    def info(self):
        return "Random Choice Agent"

class Simulation:
    def __init__(self, agent, world, k, n):
        self.world = world(k)
        self.agent = agent(self.world)
        self.epochs = n

    def run_sim(self):
        self.rewards = []
        self.proportions = []
        self.agent.initialize()
        for e in range(self.epochs):
            reward, best_action_proportion = self.agent.perform_action()
            self.rewards.append(reward)
            self.proportions.append(best_action_proportion)
        



### Visualization
**Task 3** (10%): Plot the performance of an agent over time (i.e., the action-reward rounds) in terms of reward achieved and proportion of best action chosen. Test the visualization by running your trivial agent (which just randomly pulls any trigger): you should not notice any improvements over time.

Note: it will be convenient if your visualization functionality can plot results from multiple different agents/agent runs to simplify comparison.

lol
lol
xD
xD


### Implement Reinforcement Learning

**Task 4** (35%): Implement k-armed bandit agents for your environment. In particular, implement an $\varepsilon$-greedy agent with fixed $\varepsilon$ (experiment with $\varepsilon \in {0, 0.01, 0.1, 1}$) and one with UCB action selection. Your basic agents may make use of the length of the episode $N$.

Furthermore, build some of the variations of $\varepsilon$-greedy agents: optimistic initialization and allowing for arbitrarily long episodes (i.e., not using incremental Q computation).

In your implementation, try to build an abstraction hierarchy that avoids re-writing code shared by multiple agent types.

**Task 5** (10%): Simulate learning over 2000 episodes and visualize the results. Discuss your findings for the different settings and the various learning strategies that you have implemented.



### Reward drift
**Task 6** (20%): Implement a world in which rewards of each action change gradually. Implement an agent that is suitable for such a world and compare its behaviour against the standard $\varepsilon$-greedy agent. 

### Hint

More detailed information about this topic you can find in Sutton&Barto Ch. 2, which is uploaded on Moodle.

### Report Submission

Prepare a report of your solution as a commented Jupyter notebook (using markdown for your results and comments); include figures and results.
If you must, you can also upload a PDF document with the report annexed with your Python code.

Upload your report file to the Machine Learning Moodle Course page. Please make sure that your submission team corresponds to the team's Moodle group that you're in.