# Part III (4 points +)

The frozenlake problem above is just too simple: you can beat it even with a random policy search. Go solve something more complicated.

Pick __one of the two tasks__:

* __FrozenLake8x8-v0__ - frozenlake big brother. Achieve score >0.7
* __Taxi-v1__ - essentially a maze where you get score for moving passengers to their destinations. Achieve score >-100)

Your homework assignment is beating that score (see tips below).


### Some tips:
* When solving those envs, please make sure your t_max is large enough to finish game with suboptimal policy. For example, __Taxi-v0 only trains if you let it play for 10k+ ticks/session__. For frozenlake8x8 it's less dire.
* Random policy search is worth trying as a sanity check, but in general you should expect the genetic algorithm (or anything you devised in it's place) to fare much better that random.
* While _it's okay to adapt the tabs above to your chosen env_, make sure you didn't hard-code any constants there (e.g. 16 states or 4 actions).
* `print_policy` function was built for the frozenlake-v0 env so it will break on any other env. You could simply ignore it or rewrite it for your env.
* in function `sample_reward`, __make sure t_max steps is enough to solve the environment__ even if agent is sometimes acting suboptimally. To estimate that, run several sessions without time limit and measure their length.

# Preparations
**Similar code as before:**

In [103]:
import gym
from gym import wrappers

#create a single game instance
env = gym.make("FrozenLake8x8-v0")
# wrap the code to upload it later
# more at https://gym.openai.com/docs "Recording and uploading results"
env = wrappers.Monitor(env, '/tmp/frozenlake8x8-v0')

#start new game
env.reset();

[2017-03-13 00:50:52,996] Making new env: FrozenLake8x8-v0
[2017-03-13 00:50:53,009] Clearing 1418 monitor files from previous run (because force=True was provided)
[2017-03-13 00:50:53,080] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000000.json


In [104]:
import numpy as np
n_states = env.observation_space.n
n_actions = env.action_space.n
def get_random_policy(pool=[]):
    """
    Build a numpy array representing agent policy.
    This array must have one element per each of 16 environment states.
    Element must be an integer from 0 to 3, representing action
    to take from that state.
    """
    # randint(0, 4, 16) returns an array of integers between 0 and 3 inclusive
    # (or, in other words, starting from 0 to below 4)
    # and the third therm (16), will be the array size
    rand_pol = np.random.randint(0, n_actions, n_states)
    while rand_pol in pool:
        #it will loop until a NEW random policy appear
        print("rand_pol:", rand_pol)
        print("pool: ", pool)
        rand_pol = np.random.randint(0, n_actions, n_states)
        
    return rand_pol

In [105]:
np.random.seed(1234)
policies = [get_random_policy() for i in range(10**4)]
assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == n_actions-1, 'maximal action id should match n_actions-1'
action_probas = np.unique(policies, return_counts=True)[-1] /10**4. /n_states
print("Action frequencies over 10^4 samples:",action_probas)
assert np.allclose(action_probas, [1. / n_actions] * n_actions, atol=0.05), "The policies aren't uniformly random (maybe it's just an extremely bad luck)"
print("Seems fine!")

Action frequencies over 10^4 samples: [ 0.24981406  0.25094844  0.24953906  0.24969844]
Seems fine!


### Let's evaluate!
* Implement a simple function that runs one game and returns the total reward

In [106]:
def sample_reward(env, policy, t_max=300):
    """
    Interact with an environment, return sum of all rewards.
    If game doesn't end on t_max (e.g. agent walks into a wall), 
    force end the game and return whatever reward you got so far.
    Tip: see signature of env.step(...) method above.
    """
    #s: state; where our actor is
    s = env.reset()
    total_reward = 0
    t = 0
    is_done = False

    while t < t_max and not is_done:
        # p = policy: with probabilities equal to the ones returned by get_random_policy()
        s, reward, is_done, _ = env.step(policy[s])
        # accumulating rewards
        total_reward += reward
        t+=1
    #s = env.reset()
    return total_reward

In [107]:
print("generating 10^3 sessions...")
rewards = [sample_reward(env,get_random_policy()) for _ in range(10**3)]
assert all([type(r) in (int, float) for r in rewards]), 'sample_reward must return a single number'
assert all([0 <= r <= 1 for r in rewards]), 'total rewards should be between 0 and 1 for frozenlake (if solving taxi, delete this line)'
print("Looks good!")

[2017-03-13 00:50:58,948] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000001.json
[2017-03-13 00:50:58,965] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000008.json
[2017-03-13 00:50:58,998] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000027.json
[2017-03-13 00:50:59,054] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000064.json
[2017-03-13 00:50:59,125] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000125.json
[2017-03-13 00:50:59,237] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000216.json


generating 10^3 sessions...


[2017-03-13 00:50:59,407] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000343.json
[2017-03-13 00:50:59,588] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000512.json
[2017-03-13 00:50:59,821] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video000729.json
[2017-03-13 00:51:00,143] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video001000.json


Looks good!


In [108]:
def evaluate(policy, best_score=0, n_times=20):
    """Run several evaluations and average the score the policy gets."""
    # rewards: array with n_times (100) elements consisting of the total_rewards returned by sample_reward()
    if best_score > 0.99:
        rewards = [sample_reward(env, policy) for n in range(n_times*10)]
    else:
        rewards = [sample_reward(env, policy) for n in range(n_times)]
    return float(np.mean(rewards))

#### Ignoring the random search, jumping right into
# Part II - Genetic Algorithms

In [109]:
def print_policy(policy):
    """a function that displays a policy in a human-readable way."""
    """
    lake = "SFFFFFFF
            FFFFFFFF
            FFFHFFFF
            FFFFFHFF
            FFFHFFFF
            FHHFFFHF
            FHFFHFHF
            FFFHFFFG"
    """
    lake = "SFFFFFFFFFFFFFFFFFFHFFFFFFFFFHFFFFFHFFFFFHHFFFHFFHFFHFHFFFFHFFFG"
    assert env.spec.id == "FrozenLake8x8-v0", "this function only works with frozenlake 8x8"

    
    # where to move from each tile (we're a bit unsure if this is accurate)
    arrows = ['>^v<'[a] for a in policy]
    
    #draw arrows above S and F only
    signs = [arrow if tile in "SF" else tile for arrow, tile in zip(arrows, lake)]
    
    for i in range(0, 64, 8):
        print('   '.join(signs[i:i+8]))

print("random policy:")
print_policy(get_random_policy())

random policy:
v   <   <   >   v   v   ^   ^
^   v   <   <   v   ^   >   v
>   ^   v   H   >   ^   ^   ^
<   v   >   >   <   H   <   <
v   ^   >   H   ^   ^   ^   <
v   H   H   v   v   ^   H   <
<   H   ^   <   H   <   H   ^
<   ^   v   H   ^   v   v   G


In [110]:
def crossover(policy1, policy2, p=0.5):
    """
    for each state, with probability p take action from policy1, else policy2
    """
    # policyx: [0,1,3,2,1,0,3,2,1,0,3,2,0,2,2,1]
    new_pol = []
    for i in range(len(policy1)):
        #choosing between the ith element between pol1 and pol2 with probability p
        new_pol.append(np.random.choice((policy1[i], policy2[i]), p=[p, 1-p]))
        
    return new_pol

In [111]:
def mutation(policy, p=0.1):
    """
    for each state, with probability p replace action with random action
    Tip: mutation can be written as crossover with random policy
    """
    #n_actions = env.action_space.n
    for a in policy:
        # with 10% probability, we mutate element a from policy
        if np.random.choice((0,1), p=[1-p, p]):
            policy[a] = np.random.randint(0, n_actions)
    return policy

In [112]:
np.random.seed(1234)
policies = [crossover(get_random_policy(), get_random_policy()) 
            for i in range(10**4)]

assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == n_actions-1, 'maximal action id should be n_actions-1'

assert any([np.mean(crossover(np.zeros(n_states), np.ones(n_states))) not in (0, 1)
               for _ in range(100)]), "Make sure your crossover changes each action independently"
print("Seems fine!")

Seems fine!


In [113]:
n_epochs = 1000 #default: 100 - how many cycles to make
pool_size = 100 #how many policies to maintain
n_crossovers = 50 #how many crossovers to make on each step
n_mutations = 50 #how many mutations to make on each tick
n_low_scorers = 0 #how many random low-scorer policies, not into the best scorers, can pass to the next pool

In [114]:
print("initializing...")
#pool = <spawn a list of pool_size random policies>
pool = [get_random_policy() for i in range(pool_size)]
#pool_scores = <evaluate every policy in the pool, return list of scores>
pool_scores = [evaluate(p) for p in pool]

initializing...


[2017-03-13 00:51:26,873] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video002000.json
[2017-03-13 00:51:28,096] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video003000.json


In [None]:
assert type(pool) == type(pool_scores) == list
assert len(pool) == len(pool_scores) == pool_size
assert all([type(score) in (float, int) for score in pool_scores])

# Diverse Pool

### As with 4.-moar

In [None]:
#main loop

for epoch in range(n_epochs):
    print("Epoch %s:"%epoch)
    
    # 1. Removing duplicates from pool
    # converting list of np arrays to list of tuples because I couldn't find a better way to remove duplicates
    uniques_pool = [tuple(policy) for policy in pool]
    # set() returns only unique elements. We need to convert them to np arrays again
    uniques_pool = [np.asarray(policy) for policy in set(uniques_pool)]
    if len(uniques_pool) != len(pool):
        # We found some duplicates
        print("We found", len(pool) - len(uniques_pool), "duplicated policies at this epoch!")
        pool = uniques_pool
        # we should fill pool with new random policies to keep it size, but with the code below
        # it will be maintain it's size.
    # We could check for duplicates on crossovered or mutated,
    #but it's more similar code and I want to finish this one
    
    # evaluation policies before crossovering them:
    pool_scores = [evaluate(p) for p in pool]
    # we'll select the best n_crossovers (50) policies to mix between them
    # we could use another number of best policies instead of n_crossovers (50),
    # but it's late and I don't know what I'm doing
    selected_indices = np.argsort(pool_scores)[-n_crossovers:]
    
    #crossovered = <crossover random guys from pool, n_crossovers total>
    # using selected_indices as a fraction of pool with 50 best scores
    crossovered = [crossover(pool[np.random.choice(selected_indices)], 
                             pool[np.random.choice(len(pool))]) 
                   for c in range(n_crossovers)]
    # from now on it's all the same: mutations, adding all to a pool, evaluating (again) and selecting
    # best scores. Repeat for n_epochs.
    #mutated = <add several new policies at random, n_mutations total>
    mutated = [mutation(pool[np.random.choice(len(pool))]) 
               for m in range(n_mutations)]
    
    assert type(crossovered) == type(mutated) == list
    
    #add new policies to the pool
    #pool = <add up old population with crossovers/mutations>
    #plus sing (+) concatenates lists in python
    pool = pool + crossovered + mutated
    # evaluate all policies again. sending pool_scores to evaluate more deeply with high scores
    pool_scores = [evaluate(p, pool_scores[-1]) for p in pool]
    
    # 2. Adding a couple of random low scorers to the final pool
    # select pool_size-n_low_scorers best policies. we'll add n_low_scorers later 
    selected_indices = np.argsort(pool_scores)[-pool_size+n_low_scorers:]
    # Now we need to add n_low_scorers to our indices
    # np.argsort(pool_scores)[:-pool_size+n_low_scorers] will contain all indices NOT used abobe
    # so we need to choose n_low_scorers random indices from there
    low_scorers_indices = np.random.choice(np.argsort(pool_scores)[:-pool_size+n_low_scorers], n_low_scorers)
    # now we need to concatenate all indices into one numpy array
    selected_indices = np.concatenate((selected_indices, low_scorers_indices))
    #filling pool only with best values
    pool = [pool[i] for i in selected_indices]
    pool_scores = [pool_scores[i] for i in selected_indices]

    #print the best policy so far (last in ascending score order)
    print("Best score:", pool_scores[-1])
    print_policy(pool[-1])
    print("")

Epoch 0:


[2017-03-13 00:51:29,333] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video004000.json
[2017-03-13 00:51:30,556] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video005000.json
[2017-03-13 00:51:31,868] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video006000.json
[2017-03-13 00:51:33,226] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video007000.json
[2017-03-13 00:51:34,388] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video008000.json
[2017-03-13 00:51:35,487] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video009000.json


Best score: 0.1
^   <   v   <   ^   v   v   ^
v   ^   <   <   <   v   <   ^
>   ^   >   H   v   <   <   >
v   <   v   >   <   H   ^   v
>   ^   >   H   ^   <   <   >
>   H   H   ^   <   ^   H   ^
v   H   >   >   H   <   H   v
^   >   ^   H   >   <   ^   G

Epoch 1:
We found 3 duplicated policies at this epoch!


[2017-03-13 00:51:36,656] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video010000.json
[2017-03-13 00:51:38,026] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video011000.json
[2017-03-13 00:51:39,119] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video012000.json
[2017-03-13 00:51:40,217] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video013000.json
[2017-03-13 00:51:41,502] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video014000.json
[2017-03-13 00:51:42,728] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video015000.json


Best score: 0.15
^   <   v   <   ^   v   v   ^
v   ^   <   <   <   v   <   ^
>   ^   >   H   v   <   <   >
v   <   v   >   <   H   ^   v
>   ^   >   H   ^   <   <   >
>   H   H   ^   <   ^   H   ^
v   H   >   >   H   <   H   v
^   >   ^   H   >   <   ^   G

Epoch 2:
We found 2 duplicated policies at this epoch!


[2017-03-13 00:51:44,061] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video016000.json
[2017-03-13 00:51:45,393] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video017000.json
[2017-03-13 00:51:46,599] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video018000.json
[2017-03-13 00:51:47,769] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video019000.json
[2017-03-13 00:51:48,932] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video020000.json
[2017-03-13 00:51:50,231] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video021000.json


Best score: 0.1
^   <   v   <   ^   v   v   ^
v   ^   <   <   <   v   <   ^
>   ^   >   H   v   <   <   >
v   <   v   >   <   H   ^   v
>   ^   >   H   ^   <   <   >
>   H   H   ^   <   ^   H   ^
v   H   >   >   H   <   H   v
^   >   ^   H   >   <   ^   G

Epoch 3:
We found 2 duplicated policies at this epoch!


[2017-03-13 00:51:51,415] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video022000.json
[2017-03-13 00:51:52,811] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video023000.json
[2017-03-13 00:51:53,968] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video024000.json
[2017-03-13 00:51:55,151] Starting new video recorder writing to /tmp/frozenlake8x8-v0/openaigym.video.3.25318.video025000.json


In [None]:
#from gym import wrappers
#env = gym.make('CartPole-v0')
#env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1')
#for i_episode in range(20):
#    observation = env.reset()
#    for t in range(100):
#        env.render()
#        print(observation)
#        action = env.action_space.sample()
#        observation, reward, done, info = env.step(action)
#        if done:
#            print("Episode finished after {} timesteps".format(t+1))
#            break

In [117]:
env.close()

[2017-03-13 09:53:55,998] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/frozenlake8x8-v0')


In [118]:
gym.upload('/tmp/frozenlake8x8-v0/', api_key='sk_YMIby5ovTVuzK2OejwXRTg')


[2017-03-13 10:03:23,171] [FrozenLake8x8-v0] Uploading 7242109 episodes of training data


APIConnectionError: Unexpected error communicating with OpenAI Gym (while calling post https://s3-us-west-2.amazonaws.com/openai-kubernetes-prod-scoreboard). If
this problem persists, let us know at gym@openai.com.

(Network error: ConnectionError: ('Connection aborted.', timeout('The write operation timed out',)))