# **Homework 4: Foraging and reinforcement learning**

## Getting started

This homework will involve concepts from the labs we've gone over in class. Feel free to reference them as you complete the assignment.

This homework contains 2 sections:
1. Investigation of patchy environment random initialization effects on foraging agents with different strategies.
1. Investigation of various actor-critic agents in a new type of dynamic bandit task - one where one arm becomes *more* rewarding partway through each experiment.

Fill out the code cells below and answer the questions to complete the assignment. Most of the programming is quite straightforward, as it is all based on code from the labs, which you can use/modify in this notebook.

---
## Section 1 - Foraging [57 pt]

In Lab 7, you investigated how random search, chemotaxis, and infotaxis agents behaved in a "patchy" foraging environment. We didn't get to testing out the effects of random initializations in class. In this section of the homework you will carry out that analysis.

Following the environment patches as bushes metaphor, different random seeds determine where the random bushes grow.

### Question 1.1 [6 pt]
Why is it important to check multiple random seeds when comparing foraging strategies in patchy environments?

In [None]:
# Write your answer here, as a Python comment. Explain yourself.

### In the code cells below, run and fill in code as needed according to the text instructions before each one. Feel free to refer to lab 7 for help.

Change the directory to where we want to clone in the specific explorationlib code library branch.

In [None]:
cd /content

Clone in the `target-patch-dev` explorationlib branch (the branch that has our new patchy environment functions).

In [None]:
!git clone -b target-patch-dev https://github.com/coaxlab/explorationlib

Install some other supporting code libraries, like gym-maze, which some explorationlib simulated environment code relies on.

In [None]:
cd /content/explorationlib

Install some other supporting code libraries, like gym-maze, which some explorationlib simulated environment code relies on.

In [None]:
!pip install --upgrade git+https://github.com/MattChanTK/gym-maze.git
!pip install celluloid # for the gifs

Import specific modules from the libraries we loaded. We'll use these modules to create and plot enviornments, run experiments with different exploration agents in these environments, visualize their behaviors, and evaluate their performance according to various metrics.

In [None]:
# Import misc
import shutil
import glob
import os
import copy
import sys

# Vis - 1
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Exp
from explorationlib.run import experiment
from explorationlib.util import select_exp
from explorationlib.util import load
from explorationlib.util import save

# Agents
from explorationlib.agent import DiffusionGrid
from explorationlib.agent import DiffusionDiscrete
from explorationlib.agent import GradientDiffusionGrid
from explorationlib.agent import GradientDiffusionDiscrete
from explorationlib.agent import AccumulatorGradientGrid
from explorationlib.agent import AccumulatorInfoGrid
from explorationlib.agent import TruncatedLevyDiscrete

# Env
from explorationlib.local_gym import ScentGrid
from explorationlib.local_gym import create_grid_scent
from explorationlib.local_gym import create_grid_scent_patches
from explorationlib.local_gym import uniform_targets
from explorationlib.local_gym import uniform_patch_targets
from explorationlib.local_gym import constant_values

# Vis - 2
from explorationlib.plot import plot_position2d
from explorationlib.plot import plot_length_hist
from explorationlib.plot import plot_length
from explorationlib.plot import plot_targets2d
from explorationlib.plot import plot_scent_grid

# Score
from explorationlib.score import total_reward
from explorationlib.score import num_death
from explorationlib.score import on_off_patch_time

### Create a new patchy environment [5 pt]

In the code block below, set up a new patch environment like our foraging lab in the following way:
- Have there be 4 patches of 15 targets each.
- Have each patch have radius 3.
- Set the random seed to 1257.

In [None]:
# Your code here

### Visualize the patchy environment [3 pt]
In the code cell below, make a plot of the patchy environment you just made.

In [None]:
# Your code here

### Create the agents [3 pt]
In the code cell below, create a random search, chemotaxis, and infotaxis agent like we did in Lab 7.

In [None]:
# Your code here

### Run the experiments [5 pt]
In the code cell below, run 50 experiments of 400 steps each for each of the agents. Note - you may have set the number of experiments and steps earlier during your environment setup code.

In [None]:
# Your code here

### Visualize proportion of time spent on patches [4 pt]
In the code cell below:
- Plot bar plots with error bars for the proportion of time spent on patches for each agent.
- Plot a histogram for the proportion of time spent on patches for each agent.

In [None]:
# Your code here

### Visualize total reward [4 pt]
In the code cell below:
- Plot bar plots with error bars for the total reward for each agent.
- Plot a histogram of total reward for each agent.

In [None]:
# Your code here

### Visualize agent deaths [4 pt]
In the code cell below, plot a bar plot of the number of deaths for each agent type.

In [None]:
# Your code here

### Question 1.2 [7 pt]
Describe the performance of each agent type according to each of the metrics (on-patch proportion, total reward, deaths). Why do you think this pattern of performance occured?

In [None]:
# Write your answer here, as a Python comment. Explain yourself.

### Question 1.3.1 [8 pt]
Re-run your simulations above, but change the seed value for the random number generator. Do this four different times, once each with the following values: 2257, 3257, 4257, 5257. 

What do you see in each performance metric of the agents with each new seed value (which specifies different unique environments)?

In [None]:
# Write your answers here, as Python comments.

# --For seed 2257:--


# --For seed 3257:--


# --For seed 4257:--


# --For seed 5257:--


### Question 1.3.2 [8 pt]
What does this (your results recorded in Question 1.3.1) tell you about the the difference between the Info and Chemo agents in particular in environments of this type.

In [None]:
# Write your answer here, as a Python comment. Explain yourself.

---
## Section 2 - Reinforcement learning [43 pt]

In the last part of lab 9, you investigated the performance of different reinforcement learnign agents in a changing bandit task, where an arm that used to give the most reward suddenly dropped in reward probability.

In this section of the homework, you will build and test reinforcement learning agents in a different changing bandit task - one where an arm that gave zero reward for most of the experiment changes to being rewarding at a very high probability near the end of each experiment.

Import necessary modules

In [None]:
import shutil
import glob
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import explorationlib

from explorationlib.local_gym import BanditUniform4
from explorationlib.local_gym import BanditChange4
from explorationlib.agent import BanditActorCritic
from explorationlib.agent import Critic
from explorationlib.agent import CriticUCB
from explorationlib.agent import CriticNovelty
from explorationlib.agent import EpsilonActor
from explorationlib.agent import RandomActor
from explorationlib.agent import SequentialActor
from explorationlib.agent import SoftmaxActor
from explorationlib.agent import BoundedRandomActor
from explorationlib.agent import BoundedSequentialActor
from explorationlib.agent import DeterministicActor

from explorationlib.run import experiment
from explorationlib.score import total_reward
from explorationlib.score import action_entropy
from explorationlib.util import select_exp
from explorationlib.util import load
from explorationlib.util import save

from explorationlib.plot import plot_bandit
from explorationlib.plot import plot_bandit_actions
from explorationlib.plot import plot_bandit_critic
from explorationlib.plot import plot_bandit_hist

Set up for pretty plots

In [None]:
# Pretty plots
%matplotlib inline
%config InlineBackend.figure_format='retina'
%config IPCompleter.greedy=True
plt.rcParams["axes.facecolor"] = "white"
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["font.size"] = "16"

# Dev
%load_ext autoreload
%autoreload 2

Plotting the structure of the new bandit task before and after the change

In [None]:
# Shared env params
seed = 5030

# plot env before
env1 = BanditUniform4(p_min=0.1, p_max=0.3, p_best=0.0)
env1.seed(seed)
plot_bandit(env1, alpha=0.6)

# plot env after
env2 = BanditUniform4(p_min=0.1, p_max=0.3, p_best=0.9)
env2.seed(seed)
plot_bandit(env2, alpha=0.6)

### Create this new changing bandit environment [6 pt]
To make the environment described above, set up a BanditChange4 environment with the following parameters:
- Have the number of trials before the change be 150.
- Have minimum and maximim probability of reward set to 0.1 and 0.3, respectively.
- Have the probability of reward for the "best" arm actually set to 0.0.
- Have the probability of reward for that arm after the change set to 0.9.
- Set the environment's seed to 5030.

In [None]:
# Your code here

### Question 2.1 [7 pt]
When testing later on, we will have each experiment last for 175 steps. What makes this a tricky problem? What would an agent have to do to succeed in this task?

In [None]:
# Write your answer here, as a Python comment. Explain yourself.

### Creating the reinforcement learning agents [4 pt]

In the code cell below, fill in the code for creating each agent. Use the settings from the lab (repeated here for ease):
- Random agent: no settings needed
- Epsilon-greedy agent: use epsilon value of 0.1
- Upper confidence bound agent: use bonus weight of 0.5
- Softmax actor critic: use beta value of 7

In [None]:
ran = BanditActorCritic(
    # Fill in random agent code here
    
)

epy = BanditActorCritic(
    # Fill in epsilon greedy agent code here
    
)

ucb = BanditActorCritic(
    # Fill in upper confidence bound agent code here
    
)

sft = BanditActorCritic(
    # Fill in softmax agent code here
    
)


agents = [ran, epy, ucb, sft]
names = ["random", "ep-greedy", "upper conf. bound", "softmax"]
colors = ["blue", "purple", "orange", "red"]

### Run the experiments [6 pt]

Fill in the code cell below to run 500 experiments for each agent, each with 175 steps. Set the seed to 5030 (have a code line for `seed=5030,` after the code line that sets the number of experiments).

In [None]:
# Your code here

### Visualize total rewards [4 pt]

In the code cell below, add code to plot the total reward for each agent type in the experiements.

In [None]:
# Your code here

### Question 2.2 [8 pt]
How did each of the agents do, compared to one another? Why do you think this is the case?

In [None]:
# Write your answer here, as a Python comment. Explain yourself.

### Question 2.3 [8 pt]

Re-run just the experiments and reward plotting with the following random seeds: 6030, 7030, 8030, and 9030. Make sure you are just changing the seed for the experiments, not for the bandit task itself.

How consistent are the results you see? What does this tell you about the stability of the patterns you described in Question 2.2? Why do you think this is the case?

In [None]:
# Write your answer here, as a Python comment. Explain yourself.

**IMPORTANT** Did you collaborate with anyone on this assignment? If so, list their names here. 
> *Write Name(s) here*

**DUE:** 5pm ET, Dec. 9, 2022. Email the link to the completed notebook on your Github repository to the TA and me via Canvas.