# Deep Crossentropy method

In this notebook, we will imprement deep crossentropy method to solve the [CartPole-v0 in Open AI Gym](https://gym.openai.com/envs/CartPole-v0)

First, we have to make sure we are connected to the right **python 3 reutime and using the GPU**. (Click the 'Runtime' tab and choose 'Change runtime type'), then import the required package (all are already installed in Google Colab)

Then run the following 2 cell to install the require library (may take a while) and import them to have our enviroment set up, ignore the warning (it's tricky to display the Open AI Gym videos in Colab notebook)

In [0]:
#remove " > /dev/null 2>&1" to see what is going on under the hood
!pip install pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install -U pyglet==1.3.2 > /dev/null 2>&1

In [0]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import os
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

## Let's have a look at the enviroment

In [0]:
env = gym.make("CartPole-v0").env
env.reset()
n_actions = env.action_space.n

plt.imshow(env.render("rgb_array"))


We can see it's a cart and a pole (as the name suggess). Seems a trival task, but we will use it to demonstarte how we can use neural network as an agent of policy and using deep crosscentropy method.

## Task 1: Play the game using MLP

Instead of using a array to hold our policy and probability of action to take, we will use a simple MLP neural network to give the probability. The scikit-learn MLPClassifier will do the job.

In [0]:
#create agent
from sklearn.neural_network import MLPClassifier
agent = MLPClassifier(hidden_layer_sizes=(50,50),
                      activation='tanh',
                      warm_start=True, #keep progress between .fit(...) calls
                      max_iter=1 #make only 1 iteration on each .fit(...)
                     )
#initialize agent to the dimension of state an amount of actions
agent.fit([env.reset()]*n_actions, list(range(n_actions)));


In [0]:
def generate_session(t_max=15000):
    
    states,actions = [],[]
    total_reward = 0
    
    s = env.reset()
    
    for t in range(t_max):
        
        # a vector of action probabilities in current state
        probs = <your_code> 
        
        a = <your_code>
        
        new_s, r, done, info = env.step(a)
        
        #record sessions like you did before
        
        <your_code>
        
        s = new_s
        if done: break
    return states, actions, total_reward
        

### Deep Crossentropy method steps
For the elite selection part, Deep CEM uses exactly the same strategy as the regular CEM.

The only difference is that now each observation is not a number but a float32 vector.

In [0]:
def select_elites(states_batch,actions_batch,rewards_batch,percentile=50):
    """
    Select states and actions from games that have rewards >= percentile
    :param states_batch: list of lists of states, states_batch[session_i][t]
    :param actions_batch: list of lists of actions, actions_batch[session_i][t]
    :param rewards_batch: list of rewards, rewards_batch[session_i][t]
    
    :returns: elite_states,elite_actions, both 1D lists of states and respective actions from elite sessions
    
    Please return elite states and actions in their original order 
    [i.e. sorted by session number and timestep within session]
    
    If you're confused, see examples below. Please don't assume that states are integers (they'll get different later).
    """
    
    <your_code>
    
    return elite_states, elite_actions
    

## Here's the training

It's the same: Generate sessions, select N best and fit to those.

In [0]:
from IPython.display import clear_output

def show_progress(batch_rewards, log, percentile, reward_range=[-990,+10]):
    """
    A convenience function that displays training progress. 
    No cool math here, just charts.
    """
    
    mean_reward, threshold = np.mean(batch_rewards), np.percentile(batch_rewards, percentile)
    log.append([mean_reward, threshold])

    clear_output(True)
    print("mean reward = %.3f, threshold=%.3f"%(mean_reward, threshold))
    plt.figure(figsize=[8,4])
    plt.subplot(1,2,1)
    plt.plot(list(zip(*log))[0], label='Mean rewards')
    plt.plot(list(zip(*log))[1], label='Reward thresholds')
    plt.legend()
    plt.grid()
    
    plt.subplot(1,2,2)
    plt.hist(batch_rewards, range=reward_range);
    plt.vlines([np.percentile(batch_rewards, percentile)], [0], [100], label="percentile", color='red')
    plt.legend()
    plt.grid()

    plt.show()


## Task 2: Not update policy?

The only difference between DCEM and CEM is that instaed manually updating the policy with propability, we just train the policy agent with the elites.

In [0]:
n_sessions = 100
percentile = 80
log = []

for i in range(100):
    #generate new sessions
    sessions = [<your_code>]

    batch_states,batch_actions,batch_rewards = map(np.array, zip(*sessions))

    elite_states, elite_actions = select_elites(batch_states,batch_actions,batch_rewards,percentile)
    
    <your_code>

    show_progress(batch_rewards, log, percentile)
    
    if np.mean(batch_rewards) > 190:
        break

## Let's look at results

With this enviroment, we can save the session we generate in a mp4 video. Below we show the last one (which is the one achieving the goal, aka 'winning'). You can find all videos by clicking the arrow on the left, choose the `files` tap, inside the `videos` folder.

In [0]:
#record sessions
import gym.wrappers
env = gym.wrappers.Monitor(gym.make("CartPole-v0"), directory="videos", force=True)
sessions = [generate_session() for _ in range(100)]
env.close()

In [0]:
#play video
video_names = list(filter(lambda s:s.endswith(".mp4"), os.listdir("./videos/")))
mp4 = "./videos/"+video_names[-1]
video = io.open(mp4, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls style="height: 400px;">
            <source src="data:video/mp4;base64,{0}" type="video/mp4" />
         </video>'''.format(encoded.decode('ascii')))

## Wanting more? Now what?

By this moment you should have got enough score on [CartPole-v0](https://gym.openai.com/envs/CartPole-v0) to consider it solved. It's time to get something harder.

* Pick one of environments: [MountainCar-v0](https://gym.openai.com/envs/MountainCar-v0) or [LunarLander-v2](https://gym.openai.com/envs/LunarLander-v2).
  * For MountainCar, get average reward of __at least -150__
  * For LunarLander, get average reward of __at least +50__

See the tips section below, it's kinda important.
  
  
* Bonus quest: Devise a way to speed up training at least 2x against the default version
  * Obvious improvement: use [joblib](https://www.google.com/search?client=ubuntu&channel=fs&q=joblib&ie=utf-8&oe=utf-8)
  * Try re-using samples from 3-5 last iterations when computing threshold and training
  * Experiment with amount of training iterations and learning rate of the neural network (see params)
  
  
### Tips & tricks

* Sessions for MountainCar may last for 10k+ ticks. Make sure ```t_max``` param is at least 10k.
 * Also it may be a good idea to cut rewards via ">" and not ">=". If 90% of your sessions get reward of -10k and 20% are better, than if you use percentile 20% as threshold, R >= threshold __fails cut off bad sessions__ whule R > threshold works alright.
* _issue with gym_: Some versions of gym limit game time by 200 ticks. This will prevent cem training in most cases. Make sure your agent is able to play for the specified __t_max__, and if it isn't, try `env = gym.make("MountainCar-v0").env` or otherwise get rid of TimeLimit wrapper.
* If you use old _swig_ lib for LunarLander-v2, you may get an error. See this [issue](https://github.com/openai/gym/issues/100) for solution.
* If it won't train it's a good idea to plot reward distribution and record sessions: they may give you some clue.
* 20-neuron network is probably not enough, feel free to experiment.