## Table Of Contents
1. [Create Agents](#ca)
2. [Training and evaluation](#trn)
3. [Results](#res)
4. [Next Steps](#ns)
5. [References](#ref)

Load up relevant libraries

In [1]:
import numpy as np
import tensorflow.compat.v1 as tf 
tf.disable_v2_behavior()          

from recsim.environments import interest_evolution
from recsim.agents import full_slate_q_agent, slate_decomp_q_agent
from recsim.simulator import runner_lib

Instructions for updating:
non-resource variables are not supported in the long term


<a id='ca'></a>

## Creating Agents

`slate_decomp_q_agent` and `full_slate_q_agent` are 2 variants of the SlateQ algorithm.

In [3]:
# Declare standard Full Slate Q-Agent and SlateQ agent

def create_q_agent(sess, environment, eval_mode, summary_writer=None):
    """
    Standard, non-decomposed Q-learning
    """
    kwargs = {
      'observation_space': environment.observation_space,
      'action_space': environment.action_space,
      'summary_writer': summary_writer,
      'eval_mode': eval_mode,
    }
    return full_slate_q_agent.FullSlateQAgent(sess, **kwargs)



def create_decomp_q_agent(sess, environment, eval_mode, summary_writer=None):
    """
    This is one variant of the agent featured in SlateQ paper
    """
    kwargs = {
      'observation_space': environment.observation_space,
      'action_space': environment.action_space,
      'summary_writer': summary_writer,
      'eval_mode': eval_mode,
    }
    return slate_decomp_q_agent.create_agent(agent_name= 'slate_optimal_optimal_q', sess=sess, **kwargs)

<a id='trn'></a>

## Training the Agents

### Configuring the Environment

The agent will select a slate of 2 documents as recommendations for the user from a corpus of 10 candidate documents.

In [4]:
# environment config

seed = 0
np.random.seed(seed)
env_config = {
  'num_candidates': 10,
  'slate_size': 2,
  'resample_documents': True,
  'seed': seed,
  }

### Training the Standard Full Slate Q-Agent

We train the agent 10 times with at max 50 steps in each training loop.

In [5]:
# training

tmp_q_dir = './results/fullslate_q/'
runner = runner_lib.TrainRunner(
    base_dir=tmp_q_dir,
    create_agent_fn=create_q_agent,
    env=interest_evolution.create_environment(env_config),
    episode_log_file="",
    max_training_steps=50,
    num_iterations=10)
runner.run_experiment()

INFO:tensorflow:max_training_steps = 50, number_iterations = 10,checkpoint frequency = 1 iterations.


INFO:tensorflow:max_training_steps = 50, number_iterations = 10,checkpoint frequency = 1 iterations.


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:Creating FullSlateQAgent agent with the following parameters:


INFO:tensorflow:Creating FullSlateQAgent agent with the following parameters:


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x13752b6d8>


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x13752b6d8>


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:	 observation_shape: (11, 20)


INFO:tensorflow:	 observation_shape: (11, 20)


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.
    


Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.
    


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Please use tf.global_variables instead.


Instructions for updating:
Please use tf.global_variables instead.


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:Beginning training...


INFO:tensorflow:Beginning training...


INFO:tensorflow:Starting iteration 0


INFO:tensorflow:Starting iteration 0


INFO:tensorflow:Starting iteration 1


INFO:tensorflow:Starting iteration 1


INFO:tensorflow:Starting iteration 2


INFO:tensorflow:Starting iteration 2


INFO:tensorflow:Starting iteration 3


INFO:tensorflow:Starting iteration 3


INFO:tensorflow:Starting iteration 4


INFO:tensorflow:Starting iteration 4


Instructions for updating:
Use standard file APIs to delete files with this prefix.


Instructions for updating:
Use standard file APIs to delete files with this prefix.


INFO:tensorflow:Starting iteration 5


INFO:tensorflow:Starting iteration 5


INFO:tensorflow:Starting iteration 6


INFO:tensorflow:Starting iteration 6


INFO:tensorflow:Starting iteration 7


INFO:tensorflow:Starting iteration 7


INFO:tensorflow:Starting iteration 8


INFO:tensorflow:Starting iteration 8


INFO:tensorflow:Starting iteration 9


INFO:tensorflow:Starting iteration 9


### Evaluating above agent

In [6]:
# evaluating

runner = runner_lib.EvalRunner(
      base_dir=tmp_q_dir,
      create_agent_fn=create_q_agent,
      env=interest_evolution.create_environment(env_config),
      max_eval_episodes=5,
      test_mode=True)
runner.run_experiment()

INFO:tensorflow:max_eval_episodes = 5


INFO:tensorflow:max_eval_episodes = 5


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:Creating FullSlateQAgent agent with the following parameters:


INFO:tensorflow:Creating FullSlateQAgent agent with the following parameters:


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x174647198>


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x174647198>


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:	 observation_shape: (11, 20)


INFO:tensorflow:	 observation_shape: (11, 20)


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:Beginning evaluation...


INFO:tensorflow:Beginning evaluation...


INFO:tensorflow:Restoring parameters from ./results/fullslate_q/train/checkpoints/tf_ckpt-9


INFO:tensorflow:Restoring parameters from ./results/fullslate_q/train/checkpoints/tf_ckpt-9


INFO:tensorflow:eval_file: ./results/fullslate_q/eval_5/returns_818


INFO:tensorflow:eval_file: ./results/fullslate_q/eval_5/returns_818


### Training the slate-decomposition Q-Agent

In [7]:
# training

tmp_decomp_q_dir = './results/decomp_q/'
runner = runner_lib.TrainRunner(
    base_dir=tmp_decomp_q_dir,
    create_agent_fn=create_decomp_q_agent,
    env=interest_evolution.create_environment(env_config),
    episode_log_file="",
    max_training_steps=50,
    num_iterations=10)
runner.run_experiment()

INFO:tensorflow:max_training_steps = 50, number_iterations = 10,checkpoint frequency = 1 iterations.


INFO:tensorflow:max_training_steps = 50, number_iterations = 10,checkpoint frequency = 1 iterations.


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:Creating SlateDecompQAgent agent with the following parameters:


INFO:tensorflow:Creating SlateDecompQAgent agent with the following parameters:


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x1b4ee5320>


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x1b4ee5320>


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:	 observation_shape: (11, 20)


INFO:tensorflow:	 observation_shape: (11, 20)


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


Tensor("doc_affinity_scores_ph:0", shape=(10,), dtype=float32) Tensor("networks/strided_slice:0", shape=(10,), dtype=float32, device=/device:CPU:*)
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:



Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:



Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.


Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:Beginning training...


INFO:tensorflow:Beginning training...


INFO:tensorflow:Starting iteration 0


INFO:tensorflow:Starting iteration 0


INFO:tensorflow:Starting iteration 1


INFO:tensorflow:Starting iteration 1


INFO:tensorflow:Starting iteration 2


INFO:tensorflow:Starting iteration 2


INFO:tensorflow:Starting iteration 3


INFO:tensorflow:Starting iteration 3


INFO:tensorflow:Starting iteration 4


INFO:tensorflow:Starting iteration 4


INFO:tensorflow:Starting iteration 5


INFO:tensorflow:Starting iteration 5


INFO:tensorflow:Starting iteration 6


INFO:tensorflow:Starting iteration 6


INFO:tensorflow:Starting iteration 7


INFO:tensorflow:Starting iteration 7


INFO:tensorflow:Starting iteration 8


INFO:tensorflow:Starting iteration 8


INFO:tensorflow:Starting iteration 9


INFO:tensorflow:Starting iteration 9


### Evaluating above agent

In [8]:
# evaluating

runner = runner_lib.EvalRunner(
      base_dir=tmp_decomp_q_dir,
      create_agent_fn=create_decomp_q_agent,
      env=interest_evolution.create_environment(env_config),
      max_eval_episodes=5,
      test_mode=True)
runner.run_experiment()

INFO:tensorflow:max_eval_episodes = 5


INFO:tensorflow:max_eval_episodes = 5


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:Creating SlateDecompQAgent agent with the following parameters:


INFO:tensorflow:Creating SlateDecompQAgent agent with the following parameters:


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x170e2cac8>


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x170e2cac8>


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:	 observation_shape: (11, 20)


INFO:tensorflow:	 observation_shape: (11, 20)


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


Tensor("doc_affinity_scores_ph:0", shape=(10,), dtype=float32) Tensor("networks/strided_slice:0", shape=(10,), dtype=float32, device=/device:CPU:*)
INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:Beginning evaluation...


INFO:tensorflow:Beginning evaluation...


INFO:tensorflow:Restoring parameters from ./results/decomp_q/train/checkpoints/tf_ckpt-9


INFO:tensorflow:Restoring parameters from ./results/decomp_q/train/checkpoints/tf_ckpt-9


INFO:tensorflow:eval_file: ./results/decomp_q/eval_5/returns_815


INFO:tensorflow:eval_file: ./results/decomp_q/eval_5/returns_815


<a id='res'></a>

### Standard Full Slate Q-Agent Results

In [9]:
%load_ext tensorboard

In [11]:
%tensorboard --logdir=./results/fullslate_q/ --port=8001

### Google's SlateQ-Agent Results

In [13]:
%tensorboard --logdir=./results/decomp_q/  --port=8002

We see that the SlateQ agent performs better than the Standard Q-learning agent. SlateQ agent has better Click Through Rate(CTR) and higher average rewards per episode.

<a id='ns'></a>

## Next Steps

1. Vary the discount factor Gamma to see how the performance of the agent changes.

2. Train agent on GPUs so as to train an agent that select a slate of >2 items.

<a id='ref'></a>

## References

[RecSim Colab notebooks](https://github.com/google-research/recsim/tree/master/recsim/colab)