# Homework part II

### Deep crossentropy method

By this moment you should have got enough score on [CartPole-v0](https://gym.openai.com/envs/CartPole-v0) to consider it solved (see the link). It's time to upload the result and get to something harder.

* if you have any trouble with CartPole-v0 and feel stuck, feel free to ask us or your peers for help.

### Tasks

* __2.1__ __(5 pts)__ Pick one of environments: MountainCar-v0 or LunarLander-v2.
  * For MountainCar, get average reward of __at least -150__
  * For LunarLander, get average reward of __at least +50__

See the tips section below, it's kinda important.
__Note:__ If your agent is below the target score, you'll still get most of the points depending on the result, so don't be afraid to submit it.
  
  
* __2.2__ __(bonus: 5 pts each)__ Devise a way to speed up training at least 2x against the default version
  * Obvious improvement: use [joblib](https://www.google.com/search?client=ubuntu&channel=fs&q=joblib&ie=utf-8&oe=utf-8)
  * Try re-using samples from 3-5 last iterations when computing threshold and training
  * Experiment with amount of training iterations and learning rate of the neural network (see params), show graphs for different params
  
  
### Tips
* Gym page: [mountaincar](https://gym.openai.com/envs/MountainCar-v0), [lunarlander](https://gym.openai.com/envs/LunarLander-v2)
* Sessions for MountainCar may last for 10k+ ticks. Make sure ```t_max``` param is at least 10k.
 * Also it may be a good idea to cut rewards via ">" and not ">=". If 90% of your sessions get reward of -10k and 20% are better, than if you use percentile 20% as threshold, R >= threshold __fails cut off bad sessions__ whule R > threshold works alright.
* _issue with gym_: Some versions of gym limit game time by 200 ticks. This will prevent cem training in most cases. Make sure your agent is able to play for the specified __t_max__, and if it isn't, try `env = gym.make("MountainCar-v0").env` or otherwise get rid of TimeLimit wrapper.
* If you use old _swig_ lib for LunarLander-v2, you may get an error. See this [issue](https://github.com/openai/gym/issues/100) for solution.
* If it won't train it's a good idea to plot reward distribution and record sessions: they may give you some clue. 
* 20-neuron network is probably not enough, feel free to experiment.

In [None]:
import gym
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make("MountainCar-v0").env #choose 1 env - for example: gym.make("MountainCar-v0").env
env.reset()
n_actions = env.action_space.n

In [None]:
def generate_session(t_max=10000):
    states,actions = [],[]
    total_reward = 0
    s = env.reset()
    for t in range(t_max):
        try:
            probs = agent.predict_proba([s.tolist()])[0]
        except Exception:
            probs = [1. / n_actions for i in range(n_actions)]
        a = np.random.choice(np.arange(n_actions), p=probs)
        new_s,r,done,info = env.step(a)
        states.append(s)
        actions.append(a)
        total_reward += r
        s = new_s
        if done: break
    return states,actions,total_reward

In [None]:
from IPython.display import clear_output
def show_progress(rewards_batch,log,percentile=40, reward_range=[-5000,+5000]):
    """
    A convenience function that displays training progress. 
    No cool math here, just charts.
    """
    
    mean_reward = np.mean(rewards_batch)
    threshold = np.percentile(rewards_batch,percentile)
    log.append([mean_reward,threshold])

    clear_output(True)
    print("mean reward = %.3f, threshold=%.3f"%(mean_reward,threshold))
    plt.figure(figsize=[8,4])
    plt.subplot(1,2,1)
    plt.plot(list(zip(*log))[0],label='Mean rewards')
    plt.plot(list(zip(*log))[1],label='Reward thresholds')
    plt.legend()
    plt.grid()
    plt.subplot(1,2,2)
    reward_range=[-1010,+10]
    plt.hist(rewards_batch,range=reward_range);
    plt.vlines([np.percentile(rewards_batch,percentile)],[0],[100],label="percentile",color='red')
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
def select_elites(states_batch,actions_batch,rewards_batch,percentile=50):
    """
    Select states and actions from games that have rewards >= percentile
    :param states_batch: list of lists of states, states_batch[session_i][t]
    :param actions_batch: list of lists of actions, actions_batch[session_i][t]
    :param rewards_batch: list of rewards, rewards_batch[session_i][t]
    
    :returns: elite_states,elite_actions, both 1D lists of states and respective actions from elite sessions
    
    Please return elite states and actions in their original order 
    [i.e. sorted by session number and timestep within session]
    
    If you're confused, see examples below. Please don't assume that states are integers (they'll get different later).
    """ 
    reward_threshold = np.percentile(rewards_batch, percentile) #recalc threshold. hint : np.percentile
    elite_states = []
    elite_actions = []
    for i in range(len(rewards_batch)):
        if rewards_batch[i] > reward_threshold:
            elite_states.append(states_batch[i])
            elite_actions.append(actions_batch[i])
    if(len(np.array(elite_states[0]).shape) == 1):
        elite_states = np.hstack(elite_states)
    else:
        elite_states = np.vstack(elite_states)
    elite_actions = np.hstack(elite_actions)
    return elite_states,elite_actions

In [None]:
import time
from sklearn.neural_network import MLPClassifier
agent = MLPClassifier(hidden_layer_sizes=(20,20),#(experiment with layers size),
                      activation='tanh',
                      warm_start=True, #keep progress between .fit(...) calls
                      max_iter=1 #make only 1 iteration on each .fit(...)
                     )
n_sessions = 100
percentile = 40
log = []
startTime = time.time()
for i in range(10000):
    sessions = [generate_session() for _ in range(n_sessions)]
    states_batch,actions_batch,rewards_batch = zip(*sessions)
    elite_states, elite_actions = select_elites(states_batch,actions_batch,rewards_batch,percentile=percentile)
    agent.fit(elite_states, elite_actions)
    show_progress(rewards_batch,log,percentile)
    if np.array(rewards_batch).mean() >= -150:
        print("Win!")
        break
print("Time:", time.time() - startTime)

In [None]:
import time
from multiprocessing import Pool
from sklearn.neural_network import MLPClassifier
agent = MLPClassifier(hidden_layer_sizes=(20, 20),
                      activation='tanh',
                      warm_start=True,
                      max_iter=1
                     )
n_sessions = 100
percentile = 40
log = []
startTime = time.time()
def generate_session(_):
    return generate_session()
with Pool() as pool:
    sessions = pool.map(generate_session, range(n_sessions))
states_batch, actions_batch, rewards_batch = zip(*sessions)
elite_states, elite_actions = select_elites(states_batch, actions_batch, rewards_batch, percentile=percentile)
agent.fit(elite_states, elite_actions)
show_progress(rewards_batch, log, percentile)
if np.array(rewards_batch).mean() >= -150:
    print("Win!")