# Description

Humans train dogs by delivering rewards to specific actions so that the dog will associate certain actions and situations to either positive (to be repeated again) or negative (to be avoided) values. Similarly, artificial agents are now trained automatically by reinforcement signals delivered after a goal has been achieved by the agent. Unfortunately, this process is very slow and requires many samples to learn. The aim of this project is to test whether it is possible to train a character in a video-game by delivering the reward (mouse click) at any event of teacher's choice (not only at the end of a goal). What would the best strategy for delivering a limited number of rewards (reward-shaping)? How does it compare to state-of-the-art reinforcement learning algorithms?


# Methods
We are using:
* DQN algorithm
* gym emviroment
* stable baselines https://github.com/hill-a/stable-baselines
* with pretrained models from zoo https://github.com/araffin/rl-baselines-zoo
* gym enviroment https://github.com/openai/gym
* atari game Breakout

To conduct the experiment we will:
* adopt a state-of-the-art model from the zoo
* freeze all convolutional layers (or freeze all layers except the last $n$)
* add noise to the remained layers
* develop a customized reward mechanism based on a human reaction
  + run and render the enviroment
  + recieve a reward from user's clicks
  + update the unfrozen layers according to the reward
* run experiments with a human teacher




# Requirements & Installation
We suggest using conda  
```conda create -n clickerlearning```  
```conda activate clickerlearning```  

```conda install python==3.7 --yes && conda install -c conda-forge tensorflow --yes && conda install opencv --yes && pip install gym==0.11.0 stable-baselines```  
stable-baselines are slightly outdated according to the latest changes in gym. Thus we use an older version of gym.

# Model adoption

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from stable_baselines.common.cmd_util import make_atari_env # rl-zoo model is custom in contrast to gym defaults

from stable_baselines.deepq.policies import MlpPolicy, CnnPolicy
from stable_baselines import DQN

from stable_baselines.common.vec_env import VecFrameStack

import pickle

In [2]:
env_id = 'BreakoutNoFrameskip-v4'
env = make_atari_env(env_id, num_env=1, seed=0)
# Frame-stacking with 4 frames to fit the pretrained zoo configuration
env = VecFrameStack(env, n_stack=4)

model = DQN(CnnPolicy, env, verbose=1)

file = open('BreakoutNoFrameskip-v4.pkl', 'rb')
model_dict, model_weights = pickle.load(file)

model.load_parameters(model_weights)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.


In [None]:
# oserve our agent in action

import time


n_frames = 1000
obs = env.reset()

for _ in range(n_frames):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()
    print("Action:", action, "Reward:", rewards, "Done:", dones)
    time.sleep(0.01)
env.close()


