#### TODO
* video
* put in all the data that is needed, except the large ones!

#### DONE
* logit vs log(softmax(logit))
* Hard sampling for agent training.



In [None]:
%load_ext autoreload
%autoreload 2

import numpy as np
from IPython import display
import matplotlib.pyplot as plt

# Introduction
**Deep learning approaches** (sub-symbolic) are capable processing high dimensional data like images much better than any previous technique.
They do however have several drawbacks, a major of which is that models are very hard to analyze, making it difficult to know how the system will behave to a particular input, why a particular output was produced, or debug any issues that might arise.
Traditional **logic based approaches** to artificial intelligence (symbolic) are bad at dealing natural data like images, but the internal mechanisms are much easier to understand.
Logic based approaches also have advantages in some domains, for instance when long term planning is involved.

Symbolic and sub-symbolic approaches are largely incompatible due to the lack of a coherent way to interface between them [4].
Since the strengths and weaknesses of these fields compliment each other so well such an interface is of great practical and scientific importance.
Asai and Fukunaga [4] showed that it was possible to combine a neural network model and a classical planning algorithm via the use of discrete representations.
Using their approach they were able to solve some simple puzzle like challenges (8-puzzel, tower of Hanoi) that had been modified such that they required a computer vision system in order to understand the state of the environment.



Ha and Schmidhuber [3] proposed a simple yet powerfull framework, 'World Models' that separates modelling the environemnt and selecting which action to take.
This framework is very similar to the one proposed by Asai and Fukunaga [4].
With the key differences being that everyting is continuous, and a linear agent was used instead of a planner.
Ha and Schmidhuber [3] were able to show that the World Model framework is capable of training large networks, and solve complicated environments (OpenAI CarRacing, ViZDoom).
The agent used is however a simple linear agent, and it is therefore severely limited.


The contribution of Asai and Fukunaga [4] is an important first step towards combining symbolic and sub-symbolic approaches.
The problems tackled are however very simple and how well the proposed methods scale to more difficult problems remains to be seen.
In this notebook we take a small step towards extending the work of Asai and Fukunaga [4] to more complicated problems by combining it with ideas from and Ha and Schmidhuber [3].
Specifically in this notebook we train a network to learn a discrete representation of the [OpenAI CarRacing](https://gym.openai.com/envs/CarRacing-v0/) environment, and train a simple linear agent to control the car using reinforcement learning.

> ![](CarRacing.png)
> In CarRacing the objective is to control (accelerate, turn, break) the red car, such that it drives along as much of the track as possible in a given amount of time.

## Discrete Representations
A recent yet popular method of creating discrete representations with neural networks is by using the *Gumbel-softmax reparametrization trick* [5, 6], which is what we will use in this notebook.
The Gumbel-softmax reparametrization trick was already covered in the previous ontebook which will not be rehashed here.


## World Models
The World Models framework proposed by Ha and Schmidhuber [3] is an example of *model based reinforcement learning* i.e. a model of the enironment is learnt, which is then used to train the agent (in part or fully).
This approach has many advantages[1, 2], for instance:
 * **Prior knowledge** of the environment can be encoded more easily than in model-free methods.
 * **Transfer learning** between tasks for the same environment is easily facilitated through models.
 * **High-dimentional observations** introduce several complexitites (curse of dimensionality) making it harder to solve a task. Models can typically provide low-dimensional representations that simplify the problem greatly.
 * **Data efficent learning:** Model based approaches are generally much more data effcient that model-free, which makes a big difference when data is expensive to obtain as it limits the interactions with the true environment, potenitally making training safer and faster when the environment is physical or requires complex computations.(e.g. robots).
 * **Internal Simulator** makes it possible to rollout several scenarios before making an action - enabling the use of methods like Monte Carlo tree search. This also helps with planning through by making long-term predictions through rollouts.


The World Models framework consists of three components that work closely together: 
* **Vision (V):** A convolutional variational auto-encoder with a Gaussian latent space ($z$).
* **Memory (M):** An auto-regressive model, implemented as an LSTM.
* **Controller (C):** A simple linear controller.

> ![Worldmodel diagram](images/worldmodel.png)
> Figure 4 from [3]

This setup with distinct modules enables V and M to be trained separately in a fully unsupervised way from data collected from a random policy.
This training setup provides rich training signals, enabling the use of large networks for RL, something that typically isn't possible when using only the reward as it is a relatively weak training signal.
Since C is just a linear model we can see that most of the environments complexity must reside in V and M.

Ha and Schmidhuber [3] also show that it possible to train the agent entirely within hallucinated environments (V and M, rather than the actual environment) and transfer the learnt policy directly to the real environment.

The 'World Models' framework that Ha and Schmidhuber [3] proposes combines several deep learning and reinforcement learning techniques into a simple architecture that achieve very good results.
Further more it is possible to train the V and M models in less than an hour on a normal GPU, speeding up development and total training time significantly.

# This notebook

The remainder of this notebook will cover 

For the purposes of this project we will 

We will now demonstrate the code that
1. implements both a continuous and a discrete variational auto-encoder, in order to compare our implementation with the continuous standard.
1. 

extending the work of Asai and Fukunaga [4] to more complicated problems by combining it with ideas from and Ha and Schmidhuber [3]. Specifically in this notebook we train a network to learn a discrete representation of the OpenAI CarRacing environment,


Compare continuous VAE and discrete VAE

## Training the auto-encoders
In the code snippets below we will 
1. generate training data for the VAEs

In [None]:
# Generating training data for the VAE

generate_vae_training_data = True
generate_vae_training_data = False

if generate_vae_training_data:
    import generate_VAE_data
    generate_VAE_data.main()
else:
    print('Pass')

In [None]:
# Training / loading the VAEs

train_from_scratch = True
train_from_scratch = False

if train_from_scratch:
    train_from_scratch = False # These commands cannot be run more than once per instance.
    
#     # Train continuous version
#     print('Training continuous VAE')
#     import train_VAE
#     train_VAE.train_vae()
    
    print('Training discrete VAE')
    import train_Gumbel_VAE
    train_Gumbel_VAE.train_vae()
else: # Load excisting weights
    print('Pass')


# in order to compare -- generate a random sequence - visualize latent space

In [None]:
# Latent space visualization
import train_Gumbel_VAE
import train_VAE 

import visualize_VAE

print('Generating data')
data96, data64 = visualize_VAE.generate_data()
max_len = data64.shape[0]

print("Generating Discrete predictions")
sess_G, model_G = train_Gumbel_VAE.load_vae()
pred_G = model_G.model.predict(data64, verbose=0)
logits, pre_gumbel_softmax, gumbel, hard_sample = model_G.encoder([data64])

print("Generating Continuous predictions")
sess_C, model_C = train_VAE.load_vae()
pred_C, z = model_C.predict(sess_C, data96)

print("DONE")

In [None]:
# Plot Continuous
for n in range(max_len):
    plt.subplot(131)
    plt.imshow(data[n])
    plt.title('data '+str(n))

    plt.subplot(132)
    plt.imshow(np.reshape(z[n], [4,8]), vmin=-5, vmax=5)
    plt.title('z')

    plt.subplot(133)
    plt.imshow(pred[n])
    plt.title('pred - Continuous')
    
    plt.tight_layout()
    display.clear_output(wait=True)
    plt.show()

sess.close()

In [None]:
# Plot Discrete
base_title = 'Gumbel'

for n in range(max_len):
    title = base_title+'-'+str(n) + ' - {}/{}'.format(n, max_len)

    model_G.make_image(fig, title + ' i' + str(n), data64[n], logits[n],
                    pre_gumbel_softmax[n],
                    gumbel[n], hard_sample[n], pred_G[n])
    display.clear_output(wait=True)
    plt.show()

sess.close()

## Training the agent
> * number of parameters
> * Simplifications / cheats vs Worldmodels paper
> * Loading vs training


In [None]:
# Training / loading the Agent

train_agent_from_scratch = True
train_agent_from_scratch = False

if train_agent_from_scratch:
    import train_agents
    train_agents.main()


> * Video CAE vs DAE
> * Agent learning cureve comparison


# Discussion

During training the agent trained on the discrete distribution had many situations where the agent would stop, and due to low stochasticity in the system the agent would stay there for the remainder of the episode.
Makes sense as the latent space is much smaller (in terms of bits)





# Aknowledgements
https://medium.com/applied-data-science/how-to-build-your-own-world-model-using-python-and-keras-64fb388ba459

https://github.com/dariocazzani/World-Models-TensorFlow

https://github.com/AppliedDataSciencePartners/WorldModels

https://github.com/hardmaru/WorldModelsExperiments


# Refrences
[1] Shakir Mohamed (2018) A Case Against Generative Models in RL?. Lecture at DALI 2018 Workshop on Generative Models in Reinforcement Learning. https://www.youtube.com/watch?v=EA2RtXsLSWU

[2] Chelsea Finn (2017) Deep RL Bootcamp Lecture 9 Model-based Reinforcement Learning. Lecture at Deep RL Bootcamp. https://www.youtube.com/watch?v=iC2a7M9voYU&feature=youtu.be

[3] Ha and Schmidhuber, "World Models", 2018. https://doi.org/10.5281/zenodo.1207631 https://worldmodels.github.io

[4] Asai, M., & Fukunaga, A. (2017). Classical Planning in Deep Latent Space: Bridging the Subsymbolic-Symbolic Boundary. Retrieved from http://arxiv.org/abs/1705.00154, 

[5] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. “The Concrete Distribution: a Continuous Relaxation of Discrete Random Variables.” ICLR Submission, 2017.

[6] Eric Jang, Shixiang Gu and Ben Poole. “Categorical Reparameterization by Gumbel-Softmax.” ICLR Submission, 2017.