# Description

Humans train dogs by delivering rewards to specific actions so that the dog will associate certain actions and situations to either positive (to be repeated again) or negative (to be avoided) values. Similarly, artificial agents are now trained automatically by reinforcement signals delivered after a goal has been achieved by the agent. Unfortunately, this process is very slow and requires many samples to learn. The aim of this project is to test whether it is possible to train a character in a video-game by delivering the reward (mouse click) at any event of teacher's choice (not only at the end of a goal). What would the best strategy for delivering a limited number of rewards (reward-shaping)? How does it compare to state-of-the-art reinforcement learning algorithms?


# Methods
We are using:
* DQN algorithm
* gym emviroment
* stable baselines https://github.com/hill-a/stable-baselines
* with pretrained models from zoo https://github.com/araffin/rl-baselines-zoo
* gym enviroment https://github.com/openai/gym
* atari game Breakout

To conduct the experiment we will:
* adopt a state-of-the-art model from the zoo
* freeze all convolutional layers (or freeze all layers except the last $n$)
* add noise to the remained layers
* develop a customized reward mechanism based on a human reaction
  + run and render the enviroment
  + recieve a reward from user's clicks
  + update the unfrozen layers according to the reward
* run experiments with a human teacher




# Requirements & Installation
We suggest using conda  
```conda create -n clickerlearning```  
```conda activate clickerlearning```  

```conda install python==3.7 --yes && conda install -c conda-forge tensorflow --yes && conda install opencv --yes && conda install jupyter --yes && pip install gym==0.11.0 gym[atari] stable-baselines keyboard```  
stable-baselines are slightly outdated according to the latest changes in gym. Thus we use an older version of gym.  
You would need to run from root to use ```keyboard```  
run ```sudo [path to your required enviroment]/bin/jupyter notebook --allow-root ```

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import tensorflow as tf
import gym

from stable_baselines.common.cmd_util import make_atari_env # rl-zoo model is custom in contrast to gym defaults

from stable_baselines.deepq.policies import MlpPolicy, CnnPolicy
from stable_baselines import DQN

from stable_baselines.common.vec_env import VecFrameStack

import pickle

## Custom Wrapper

In [None]:
from stable_baselines.common.vec_env.base_vec_env import VecEnvWrapper
from stable_baselines.common.vec_env import VecFrameStack

import threading, time
import keyboard

global_reward = 0.0

def reward_checker():
    global global_reward
    while True:
        keyboard.wait('space')       
        global_reward = 1.0
        time.sleep(0.05)

threading.Thread(target=reward_checker).start()

        
class VecRewardWrapper(VecEnvWrapper):
    def reset(self):
        """
        Reset all environments
        """
        obs = self.venv.reset()
        self.stackedobs[...] = 0
        self.stackedobs[..., -obs.shape[-1]:] = obs
        return self.stackedobs
    
    def step_wait(self):
        global global_reward
        observations, rewards, dones, infos = self.venv.step_wait()
        print(rewards)
        rewards[0] = global_reward
        # reward can be modified here
        global_reward = 0.0
        return observations, rewards, dones, infos


env_id = 'BreakoutNoFrameskip-v4'
env = make_atari_env(env_id, num_env=1, seed=0)
env = VecFrameStack(env, n_stack=4)
env = VecRewardWrapper(env)


model = DQN(CnnPolicy, env, verbose=2)

file = open('BreakoutNoFrameskip-v4.pkl', 'rb')
model_dict, model_weights = pickle.load(file)
model.load_parameters(model_weights)

# Demo

In [None]:
# oserve our agent in action
# probably notebook is not the best enviroment now
# reward is actually collected from listener global_reward that waits for a space to be pressed

n_frames = 1000
obs = env.reset()

for _ in range(n_frames):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()
    print("Action:", action, "Reward:", rewards, "Done:", dones)
    time.sleep(0.1)
env.close()




# Explore model layers

In [42]:
model.sess.graph.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='deepq')

[<tf.Variable 'deepq/eps:0' shape=() dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/c1/w:0' shape=(8, 8, 4, 32) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/c1/b:0' shape=(1, 32, 1, 1) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/c2/w:0' shape=(4, 4, 32, 64) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/c2/b:0' shape=(1, 64, 1, 1) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/c3/w:0' shape=(3, 3, 64, 64) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/c3/b:0' shape=(1, 64, 1, 1) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/fc1/w:0' shape=(3136, 512) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/fc1/b:0' shape=(512,) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/fully_connected/weights:0' shape=(512, 4) dtype=float32_ref>,
 <tf.Variable 'deepq/model/action_value/fully_connected/biases:0' shape=(4,) dtype=float32_ref>,
 <tf.Variable 'deepq/model/state_value/fully_

## Generating TensorBoard files

In [None]:
model = DQN(CnnPolicy, env, verbose=2)

file = open('BreakoutNoFrameskip-v4.pkl', 'rb')
model_dict, model_weights = pickle.load(file)
model.load_parameters(model_weights)

with model.sess as sess:
    writer = tf.summary.FileWriter("tensor_files", sess.graph)
    sess.run(model.sess.graph.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='deepq')[-1])
    writer.close()

After generating files, you should have a directory in your project called ```tensor_files```. In order to load ```tensor_files``` run the following command ```tensorboard --logdir=tensor_files``` in the same directory