# "Curiosity" Homework



## Section A - "Choose your fighter"

In this homework, we will be exploring (pun intended) some more agents for the multi-armed bandit task. This assignment will also be a bit more open-ended. Up until now, we have always specified which agents to run, but now you will be tasked with "designing" your own agents. Not from scratch of course, but from the set of building blocks (Actors and Critics) which we established in Lab 4 and Lab 5.

So far we have seen 5 different Actors, which use the estimated values for the different arms to return a selection according to different strategies.

- DeterministicActor
- BoundedRandomActor (parameterized by bound)
- BoundedSequentialActor (parameterized by bound)
- EpsilonActor (parameterized by epsilon)
- SoftmaxActor (parameterized by beta)

We have also seen 3 different Critics, which are responsible for returning the "value" of each arm by combining extrinsic and (possibly) intrinsic values.

- Critic
- CriticUCB
- CriticNovelty

If you do the math, there's 15 possible combinations. Between labs 4 and 5, we have used 7:

- DeterministicActor / Critic
- BoundedRandomActor / Critic
- BoundedSequentialActor / Critic
- EpsilonActor / Critic
- SoftmaxActor / Critic
- DeterministicActor / CriticUCB
- DeterministicActor / CriticNovelty

Our main question: which combinations of Actor and Critic work best?

To help answer this question, we will ask you to select three new combinations to test out. You will be asked to tune your selected agents and run them in different evironments.

### Question 1 [5 pts]

Create a list of three *new* Actor/Critic pairings, which will serve as your agents for the remainder of this assignment.

There's a second restriction we are adding: you can't use the same critic for all three agents.

Write your answer here.

## Section B - Notebook setup [5 pts]

This lab uses the `DeceptiveBanditOneHigh10` environment (along with the `DeceptiveBanditEnv` parent class), which have been newly ported to explorationlib.  Therefore, you will have to update your personal copy of `local_gym.py` to include these two classes (which can be found in the clappm/explorationlib repo).

Install explorationlib, import the agents/critics/environments, and configure the notebook

In [2]:
# your code cells here

## Section C - Four-arm bandits [40 pts total]



We will first consider the 4-arm environment we've used several times before.

### Creating the training environment

It's always good practice to test your agents on a different environment than the one on which they were trained.  For now we will create a training environment.

In [None]:
# don't touch
# Shared env params
seed = 412
num_steps = 400

# Create env
env = BanditUniform4(p_min=0.1, p_max=0.3, p_best=0.35)
env.seed(seed)

# Plot env
plot_bandit(env, alpha=0.6)

### Question 2 [10 pts]

How do you expect your agents to perform relative to each other, in a 4-arm environment after they have been tuned? Which one will do best and which one will do worst?  Please explain your ranking, *considering the functionality and contributions of both the Actor and Critic components of each agent*. To help justify your hypothesis, it will be useful to briefly reference previous simulations and results (Labs 4/5 and HW 4).

Write your answer here.

### Tuning Agent 1 for 4-armed bandit [5 pts]

Each of your agents should have 1 tunable parameter.  The name and functionality of this parameter depends on which Actors you selected. For this homework, we are not going to be tuning the Critic in any way, we will simply be using the default parameters. The examples from Lab 5 should show you how to create an agent from a combination of Actor and Critic.

First tune the parameter of Agent 1, whatever it may be, using the training environment we've established. Show your different simulation batches and plots in different cells so that we may see your work.  The exact process by which you tune your agents is up to you.

We understand that tuning can be tedious... we are not asking for perfection. We don't have an answer key for parameter values. The goal is just for you to find parameters that are *good enough* so that the comparison between agents can be considered fair.

In [3]:
# your code cells here

What parameter value did you you settle on for Agent 1?

Write your answer here.

### Tuning Agent 2 for 4-armed bandit [5 pts]

In [4]:
# your code cells here

What parameter value did you you settle on for Agent 2?

Write your answer here.

### Tuning Agent 3 for 4-armed bandit [5 pts]

In [None]:
# your code cells here

What parameter value did you you settle on for Agent 3?

Write your answer here.

### Creating a testing environment

In [None]:
# don't touch
# Shared env params
seed = 15213
num_steps = 400

# Create env
env = BanditUniform4(p_min=0.1, p_max=0.3, p_best=0.35)
env.seed(seed)

# Plot env
plot_bandit(env, alpha=0.6)

### Run 400 experiments and plot the average rewards for the 3 agents [10 pts]

In [None]:
# your code cells here

### Question 3 [5 pts]

Did your results match what you predicted in Question 2? If not, do you have any ideas as to why?

Write your answer here.

## Section D - Deceptive bandits [50 pts total]

We will now consider the same types of agents placed into a different type of environment: the deceptive bandit.

### Creating the training environment

We are going to retune our agents, but we're not going to tune them against the deceptive bandit (or else it wouldn't be very deceptive, would it?).  Instead we are going to tune them against the non-deceptive 10-arm bandit from lab, which is identical to the deceptive bandit in terms of arm values but without the deception.

In [None]:
# don't touch
# Shared env params
seed = 503
num_steps = 400

# Create env
env = BanditUniform10(p_min=0.2, p_max=0.2, p_best=0.8)
env.seed(seed)

# Plot env
plot_bandit(env, alpha=0.6)

### Question 4 [10 pts]

How well do you expect your agents to perform in the deceptive 10-arm environment, after being tuned in the non-deceptive environment?  Which one will score the highest and which one will score the lowest?

Similar to when you made a hypothesis in Section C, explain your answer fully.  Base it on your understanding of the properties of the actors and critics, as well as the results in lab. Consider the possible weaknesses of the agents, such as how certain parameter values might allow the agents to perform better in training but also cause the agents to be more easily deceived.

Write your answer here.

### Tuning Agent 1 for 10-armed bandit [5 pts]

The process for tuning agents here should be roughly the same as in Section C.

In [None]:
# your code cells here

What parameter value did you you settle on for Agent 1?

Write your answer here.

### Tuning Agent 2 for 10-armed bandit [5 pts]

In [None]:
# your code cells here

What parameter value did you you settle on for Agent 2?

Write your answer here.

### Tuning Agent 3 for 10-armed bandit [5 pts]

In [None]:
# your code cells here

What parameter value did you you settle on for Agent 3?

Write your answer here.

### Creating a testing (deceptive) environment

In [None]:
# don't touch
# Shared env params
seed = 15213
num_steps = 400

# Create env
env = DeceptiveBanditOneHigh10()
env.seed(seed)

# Plot env
plot_bandit(env, alpha=0.6)

### Run 400 experiments and plot the average rewards for the 3 agents [10 pts]

In [None]:
# your code cells here

### Question 5 [5 pts]

Did your results match what you predicted in Question 4? If not, do you have any ideas as to why?

Write your answer here.

### Question 6 [10 pts]

Time for some conclusions. Was there a clear winner among your three selected agents?  Was there one which performed the best against both non-deceptive and deceptive bandits?  Or did different agents perform better or worse in different scenarios?  Which agent would you pick as your favorite?

## Submission

**DUE:** 5pm EST, Nov 30, 2021. Email the link to the completed notebook on your Github repository to the TA and me via Canvas.

**IMPORTANT** Did you collaborate with anyone on this assignment? If so, list their names here. 
> *Someone's Name*