### Bonus II (4 points):
* There are environments with continuous state spaces. In fact, most real world environments have this property. While we will dive into methods designed for that later, right now you already can solve them through binarization.
 * Gym has a basic infinite-state-space env called [CartPole](https://gym.openai.com/envs/CartPole-v0) - please start from this one. Solving something more challenging is great, but make sure your algorithm beats cartpole first. Also kudos for submitting.
 * Main idea: if you have something infinite and you want something discrete, you split it into bins. Like what histogram does.
 * Good choice of discretes is critical!
 * If the dimensionality is too high, you can try to reduce it (PCA/autoencoders)

# Preparations
**Similar code as before:**

In [5]:
import gym
#from gym import wrappers

#create a single game instance
env = gym.make("CartPole-v0")
#env = wrappers.Monitor(env, '/tmp/frozenlake8x8-v0')

#start new game
env.reset();

[2017-03-13 00:42:20,808] Making new env: CartPole-v0


In [6]:
import numpy as np
n_states = 30
n_space_high = env.observation_space.high
n_space_low = env.observation_space.low
# we should discretize this space
discrete_states = np.linspace(n_space_low, n_space_high, n_states)
n_actions = env.action_space.n
def get_random_policy(pool=[]):
    """
    Build a numpy array representing agent policy.
    This array must have one element per each of 16 environment states.
    Element must be an integer from 0 to 3, representing action
    to take from that state.
    """
    # randint(0, 4, 16) returns an array of integers between 0 and 3 inclusive
    # (or, in other words, starting from 0 to below 4)
    # and the third therm (16), will be the array size
    rand_pol = np.random.randint(0, n_actions, n_states)
    while rand_pol in pool:
        #it will loop until a NEW random policy appear
        print("rand_pol:", rand_pol)
        print("pool: ", pool)
        rand_pol = np.random.randint(0, n_actions, n_states)
        
    return rand_pol

In [7]:
np.random.seed(1234)
policies = [get_random_policy() for i in range(10**4)]
assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == n_actions-1, 'maximal action id should match n_actions-1'
action_probas = np.unique(policies, return_counts=True)[-1] /10**4. /n_states
print("Action frequencies over 10^4 samples:",action_probas)
assert np.allclose(action_probas, [1. / n_actions] * n_actions, atol=0.05), "The policies aren't uniformly random (maybe it's just an extremely bad luck)"
print("Seems fine!")

Action frequencies over 10^4 samples: [ 0.49989  0.50011]
Seems fine!


### Let's evaluate!
* Implement a simple function that runs one game and returns the total reward

In [10]:
def sample_reward(env, policy, t_max=20000):
    """
    Interact with an environment, return sum of all rewards.
    If game doesn't end on t_max (e.g. agent walks into a wall), 
    force end the game and return whatever reward you got so far.
    Tip: see signature of env.step(...) method above.
    """
    #s: state; where our actor is
    s = env.reset()
    (pos, cart_v, pole_angle, pole_v) = s
    total_reward = 0
    t = 0
    is_done = False
    
    while t < t_max and not is_done:
        # p = policy: with probabilities equal to the ones returned by get_random_policy()
        s, reward, is_done, _ = env.step(policy[s])
        (pos, cart_v, pole_angle, pole_v) = s
        print("s: ",s)
        # accumulating rewards
        total_reward += reward
        t+=1
    #s = env.reset()
    return total_reward

In [11]:
print("generating 10^3 sessions...")
rewards = [sample_reward(env,get_random_policy()) for _ in range(10**3)]
assert all([type(r) in (int, float) for r in rewards]), 'sample_reward must return a single number'
#assert all([0 <= r <= 1 for r in rewards]), 'total rewards should be between 0 and 1 for frozenlake (if solving taxi, delete this line)'
print("Looks good!")

generating 10^3 sessions...


IndexError: arrays used as indices must be of integer (or boolean) type

In [124]:
def evaluate(policy, n_times=10):
    """Run several evaluations and average the score the policy gets."""
    # rewards: array with n_times (100) elements consisting of the total_rewards returned by sample_reward()
    rewards = [sample_reward(env, policy) for n in range(n_times)]
    #print("rewards: ", rewards)
    return float(np.mean(rewards))

#### Ignoring the random search, jumping right into
# Part II - Genetic Algorithms

In [125]:
def print_policy(policy):
    pass

print("random policy:")
print_policy(get_random_policy())

random policy:
< : d | d : v : p
p : < : p : d : p
v : p : d : p : p
< | < : > | < : ^
< | ^ : d | < : p


In [126]:
def crossover(policy1, policy2, p=0.5):
    """
    for each state, with probability p take action from policy1, else policy2
    """
    # policyx: [0,1,3,2,1,0,3,2,1,0,3,2,0,2,2,1]
    new_pol = []
    for i in range(len(policy1)):
        #choosing between the ith element between pol1 and pol2 with probability p
        new_pol.append(np.random.choice((policy1[i], policy2[i]), p=[p, 1-p]))
        
    return new_pol

In [127]:
def mutation(policy, p=0.01):
    """
    for each state, with probability p replace action with random action
    Tip: mutation can be written as crossover with random policy
    """
    # if we modify "policy" directly, we'll change the value of policy. Lists work that way, so we
    # need to use a copy
    mutated_policy = list(policy)
    #n_actions = env.action_space.n
    for a in policy:
        # with 1% probability, we mutate element a from policy
        if np.random.choice((0,1), p=[1-p, p]):
            mutated_policy[a] = np.random.randint(0, n_actions)
    return mutated_policy

In [133]:
np.random.seed(1234)
policies = [crossover(get_random_policy(), get_random_policy()) 
            for i in range(10**2)]

assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == n_actions-1, 'maximal action id should be n_actions-1'

assert any([np.mean(crossover(np.zeros(n_states), np.ones(n_states))) not in (0, 1)
               for _ in range(100)]), "Make sure your crossover changes each action independently"
print("Seems fine!")

Seems fine!


In [134]:
n_epochs = 1000 #default: 100 - how many cycles to make
pool_size = 100 #how many policies to maintain
n_crossovers = 100 #how many crossovers to make on each step
n_mutations = 200 #how many mutations to make on each tick
n_low_scorers = 0 #how many random low-scorer policies, not into the best scorers, can pass to the next pool

In [135]:
print("initializing...")
#pool = <spawn a list of pool_size random policies>
pool = [get_random_policy() for i in range(pool_size)]
#pool_scores = <evaluate every policy in the pool, return list of scores>
pool_scores = [evaluate(p) for p in pool]

initializing...


In [136]:
assert type(pool) == type(pool_scores) == list
assert len(pool) == len(pool_scores) == pool_size
assert all([type(score) in (float, int) for score in pool_scores])

# Diverse Pool

### As with 4.-moar

In [137]:
#main loop
from tqdm import tqdm
#for epoch in range(n_epochs):
for epoch in tqdm(range(n_epochs)):
    print("Epoch %s:"%epoch)
    
    # 1. Removing duplicates from pool
    # converting list of np arrays to list of tuples because I couldn't find a better way to remove duplicates
    uniques_pool = [tuple(policy) for policy in pool]
    # set() returns only unique elements. We need to convert them to np arrays again
    uniques_pool = [np.asarray(policy) for policy in set(uniques_pool)]
    if len(uniques_pool) != len(pool):
        # We found some duplicates
        print("We found", len(pool) - len(uniques_pool), "duplicated policies at this epoch!")
        pool = uniques_pool
        # we should fill pool with new random policies to keep it size, but with the code below
        # it will be maintain it's size.
    # We could check for duplicates on crossovered or mutated,
    #but it's more similar code and I want to finish this one
    
    # evaluation policies before crossovering them:
    pool_scores = [evaluate(p) for p in pool]
    # we'll select the best n_crossovers (50) policies to mix between them
    # we could use another number of best policies instead of n_crossovers (50),
    # but it's late and I don't know what I'm doing
    selected_indices = np.argsort(pool_scores)[-n_crossovers:]
    
    #crossovered = <crossover random guys from pool, n_crossovers total>
    # using selected_indices as a fraction of pool with 50 best scores
    crossovered = [crossover(pool[np.random.choice(selected_indices)], 
                             pool[np.random.choice(len(pool))]) 
                   for c in range(n_crossovers)]
    # from now on it's all the same: mutations, adding all to a pool, evaluating (again) and selecting
    # best scores. Repeat for n_epochs.
    #mutated = <add several new policies at random, n_mutations total>
    mutated = [mutation(pool[np.random.choice(len(pool))]) 
               for m in range(n_mutations)]
    
    assert type(crossovered) == type(mutated) == list
    
    #add new policies to the pool
    #pool = <add up old population with crossovers/mutations>
    #plus sing (+) concatenates lists in python
    pool = pool + crossovered + mutated
    #pool_scores = <evaluate all policies again>
    pool_scores = [evaluate(p) for p in pool]
    
    # 2. Adding a couple of random low scorers to the final pool
    # select pool_size-n_low_scorers best policies. we'll add n_low_scorers later 
    selected_indices = np.argsort(pool_scores)[-pool_size+n_low_scorers:]
    # Now we need to add n_low_scorers to our indices
    # np.argsort(pool_scores)[:-pool_size+n_low_scorers] will contain all indices NOT used abobe
    # so we need to choose n_low_scorers random indices from there
    low_scorers_indices = np.random.choice(np.argsort(pool_scores)[:-pool_size+n_low_scorers], n_low_scorers)
    # now we need to concatenate all indices into one numpy array
    selected_indices = np.concatenate((selected_indices, low_scorers_indices))
    #filling pool only with best values
    pool = [pool[i] for i in selected_indices]
    pool_scores = [pool_scores[i] for i in selected_indices]

    #print the best policy so far (last in ascending score order)
    print("Best score:", pool_scores[-1])
    print_policy(pool[-1])
    print("")


  0%|          | 0/1000 [00:00<?, ?it/s][A

Epoch 0:



  0%|          | 1/1000 [00:24<6:53:36, 24.84s/it]

Best score: -200.0
^ : p | d : > : >
> : v : v : v : p
d : v : > : p : p
d | p : d | < : v
p | d : v | < : ^

Epoch 1:
We found 1 duplicated policies at this epoch!


  0%|          | 2/1000 [00:49<6:52:41, 24.81s/it]

Best score: -380.0
^ : ^ | v : < : d
> : v : < : d : ^
d : > : d : d : >
< | p : < | < : <
^ | < : v | d : v

Epoch 2:


  0%|          | 3/1000 [01:13<6:49:47, 24.66s/it]

Best score: -379.1
< : > | < : v : <
> : p : d : v : v
d : v : p : p : <
d | v : d | < : >
^ | d : p | < : ^

Epoch 3:


  0%|          | 4/1000 [01:38<6:47:01, 24.52s/it]

Best score: -378.2
> : > | ^ : < : >
^ : > : d : v : v
^ : p : < : p : ^
> | < : d | p : >
^ | p : p | d : v

Epoch 4:


  0%|          | 5/1000 [02:02<6:44:09, 24.37s/it]

Best score: -378.2
d : < | p : v : d
> : > : > : p : v
p : p : v : v : <
^ | v : > | > : v
> | ^ : p | < : <

Epoch 5:
We found 2 duplicated policies at this epoch!


  1%|          | 6/1000 [02:26<6:42:15, 24.28s/it]

Best score: -379.1
p : ^ | > : > : ^
v : < : ^ : p : p
< : p : > : v : <
v | v : < | > : v
> | ^ : < | d : p

Epoch 6:


  1%|          | 7/1000 [02:49<6:38:45, 24.09s/it]

Best score: -378.2
v : p | < : v : d
d : ^ : d : p : ^
^ : p : < : > : v
p | < : p | ^ : p
d | p : > | ^ : v

Epoch 7:
We found 1 duplicated policies at this epoch!


  1%|          | 8/1000 [03:13<6:37:17, 24.03s/it]

Best score: -200.0
v : < | v : v : d
< : > : > : ^ : ^
^ : v : > : < : ^
> | p : v | ^ : >
d | > : v | > : d

Epoch 8:


  1%|          | 9/1000 [03:37<6:35:58, 23.97s/it]

Best score: -378.2
d : < | d : v : d
d : ^ : d : p : ^
^ : p : < : > : v
p | < : p | ^ : p
d | p : > | ^ : v

Epoch 9:


  1%|          | 10/1000 [04:01<6:34:00, 23.88s/it]

Best score: -200.0
d : d | ^ : ^ : <
v : v : ^ : ^ : <
^ : p : ^ : ^ : >
v | < : d | ^ : v
d | > : ^ | < : <

Epoch 10:
We found 2 duplicated policies at this epoch!


  1%|          | 11/1000 [04:25<6:33:49, 23.89s/it]

Best score: -200.0
< : d | < : p : d
^ : < : > : p : ^
> : p : < : p : v
> | < : p | d : >
d | p : < | v : ^

Epoch 11:


  1%|          | 12/1000 [04:49<6:34:42, 23.97s/it]

Best score: -379.1
d : < | p : < : d
^ : > : d : ^ : ^
p : p : > : > : ^
> | p : p | < : >
d | > : < | > : v

Epoch 12:
We found 2 duplicated policies at this epoch!


  1%|▏         | 13/1000 [05:13<6:33:48, 23.94s/it]

Best score: -200.0
> : v | ^ : v : <
^ : > : d : p : ^
d : v : d : p : ^
< | < : v | d : >
d | p : p | d : v

Epoch 13:


  1%|▏         | 14/1000 [05:37<6:33:02, 23.92s/it]

Best score: -200.0
p : p | d : < : <
> : < : d : ^ : ^
> : p : < : p : ^
^ | v : v | d : >
^ | p : p | ^ : ^

Epoch 14:
We found 1 duplicated policies at this epoch!


  2%|▏         | 15/1000 [06:00<6:31:08, 23.83s/it]

Best score: -376.4
^ : < | d : v : >
v : > : > : > : v
^ : p : < : v : ^
< | < : v | < : >
^ | > : p | d : d

Epoch 15:


  2%|▏         | 16/1000 [06:24<6:28:33, 23.69s/it]

Best score: -200.0
> : p | < : < : ^
< : > : > : > : v
^ : v : > : p : v
^ | p : > | d : <
d | < : < | < : d

Epoch 16:
We found 1 duplicated policies at this epoch!


  2%|▏         | 17/1000 [06:47<6:27:49, 23.67s/it]

Best score: -200.0
> : ^ | < : < : ^
< : < : > : > : v
^ : v : > : v : p
> | v : > | d : <
d | < : < | < : d

Epoch 17:


  2%|▏         | 18/1000 [07:11<6:26:42, 23.63s/it]

Best score: -200.0
v : v | < : v : >
v : > : > : > : v
^ : p : < : v : ^
< | < : v | < : >
^ | > : p | d : d

Epoch 18:


  2%|▏         | 19/1000 [07:34<6:25:08, 23.56s/it]

Best score: -200.0
p : ^ | d : v : d
^ : > : d : p : ^
p : p : < : > : ^
> | < : d | ^ : >
d | p : p | d : v

Epoch 19:


  2%|▏         | 20/1000 [07:58<6:24:45, 23.56s/it]

Best score: -200.0
^ : > | > : ^ : <
> : > : d : p : v
d : v : d : < : ^
< | < : v | < : >
^ | p : p | d : d

Epoch 20:
We found 1 duplicated policies at this epoch!


  2%|▏         | 21/1000 [08:21<6:24:31, 23.57s/it]

Best score: -200.0
^ : < | v : v : >
v : > : > : > : v
^ : p : < : v : ^
< | < : v | < : >
^ | > : p | d : d

Epoch 21:
We found 1 duplicated policies at this epoch!


  2%|▏         | 22/1000 [08:46<6:29:10, 23.88s/it]

Best score: -377.3
< : d | > : < : ^
< : > : > : > : v
^ : v : > : < : v
> | p : v | d : <
d | < : < | v : d

Epoch 22:
We found 1 duplicated policies at this epoch!


  2%|▏         | 23/1000 [09:10<6:31:26, 24.04s/it]

Best score: -200.0
p : v | v : v : <
> : > : d : > : p
d : v : d : v : ^
< | < : v | d : >
^ | > : p | d : d

Epoch 23:


  2%|▏         | 24/1000 [09:35<6:33:01, 24.16s/it]

Best score: -200.0
> : p | p : < : >
> : > : > : > : v
^ : p : ^ : < : ^
> | < : v | d : <
^ | p : p | v : d

Epoch 24:


  2%|▎         | 25/1000 [09:59<6:33:34, 24.22s/it]

Best score: -200.0
p : p | < : < : ^
< : > : > : > : v
^ : v : > : p : v
^ | p : > | d : <
d | < : < | < : d

Epoch 25:
We found 1 duplicated policies at this epoch!


  3%|▎         | 26/1000 [10:23<6:33:51, 24.26s/it]

Best score: -200.0
> : > | d : < : >
> : > : d : > : p
d : p : > : < : v
> | < : v | d : >
^ | p : < | d : d

Epoch 26:
We found 1 duplicated policies at this epoch!


  3%|▎         | 27/1000 [10:48<6:33:15, 24.25s/it]

Best score: -200.0
< : ^ | v : v : ^
^ : > : > : > : p
> : p : > : v : ^
< | < : > | < : <
d | < : < | > : d

Epoch 27:


  3%|▎         | 28/1000 [11:12<6:33:28, 24.29s/it]

Best score: -200.0
p : ^ | d : < : >
< : > : > : > : v
^ : p : d : < : ^
> | < : v | < : <
^ | < : p | v : v

Epoch 28:


  3%|▎         | 29/1000 [11:36<6:33:11, 24.30s/it]

Best score: -200.0
< : p | d : < : >
> : > : > : > : v
^ : v : > : < : v
> | < : v | d : <
^ | < : < | v : d

Epoch 29:


  3%|▎         | 30/1000 [12:00<6:30:10, 24.13s/it]

Best score: -200.0
d : ^ | ^ : < : ^
> : > : > : > : v
d : p : > : v : v
> | p : v | d : <
d | p : < | d : d

Epoch 30:
We found 2 duplicated policies at this epoch!


  3%|▎         | 31/1000 [12:24<6:27:28, 23.99s/it]

Best score: -200.0
< : < | ^ : v : >
^ : > : > : > : v
> : p : < : < : ^
< | < : v | < : <
d | < : p | > : d

Epoch 31:
We found 1 duplicated policies at this epoch!


  3%|▎         | 32/1000 [12:48<6:26:21, 23.95s/it]

Best score: -200.0
v : p | p : v : ^
> : > : > : > : p
^ : v : < : < : ^
< | < : v | d : <
^ | < : p | v : d

Epoch 32:


  3%|▎         | 33/1000 [13:11<6:25:33, 23.92s/it]

Best score: -200.0
v : < | d : < : ^
< : > : d : > : p
^ : p : > : p : ^
> | < : v | < : >
^ | p : p | v : d

Epoch 33:


  3%|▎         | 34/1000 [13:35<6:23:38, 23.83s/it]

Best score: -200.0
v : > | d : < : >
> : > : d : ^ : v
^ : v : > : < : v
> | < : v | < : <
^ | p : p | d : d

Epoch 34:


  4%|▎         | 35/1000 [13:59<6:22:48, 23.80s/it]

Best score: -200.0
d : v | v : < : ^
< : > : > : p : v
^ : v : < : < : v
> | < : v | d : >
^ | < : p | v : d

Epoch 35:


  4%|▎         | 36/1000 [14:23<6:23:32, 23.87s/it]

Best score: -200.0
p : d | ^ : < : >
> : > : > : ^ : v
^ : p : > : v : ^
< | < : v | < : <
^ | p : p | d : d

Epoch 36:


  4%|▎         | 37/1000 [14:46<6:21:00, 23.74s/it]

Best score: -200.0
> : < | ^ : < : ^
> : > : > : p : v
^ : v : d : < : ^
> | < : v | d : <
^ | < : p | v : d

Epoch 37:


  4%|▍         | 38/1000 [15:10<6:19:36, 23.68s/it]

Best score: -200.0
< : v | v : < : >
^ : > : > : > : v
^ : v : > : p : ^
> | < : v | d : >
^ | > : < | v : d

Epoch 38:


  4%|▍         | 39/1000 [15:33<6:19:10, 23.67s/it]

Best score: -200.0
v : > | ^ : < : ^
< : > : > : p : v
^ : v : < : < : v
> | < : v | d : <
^ | < : p | v : d

Epoch 39:


  4%|▍         | 40/1000 [15:57<6:18:42, 23.67s/it]

Best score: -200.0
v : d | d : < : ^
^ : > : > : > : v
^ : v : d : < : v
> | < : v | d : <
d | < : p | v : d

Epoch 40:


  4%|▍         | 41/1000 [16:21<6:20:22, 23.80s/it]

Best score: -200.0
< : d | < : < : >
< : > : > : p : p
^ : v : > : < : v
> | < : v | d : >
^ | < : p | v : d

Epoch 41:


  4%|▍         | 42/1000 [16:45<6:20:42, 23.84s/it]

Best score: -200.0
> : > | v : < : >
^ : > : > : > : v
^ : v : > : p : ^
> | < : v | d : >
^ | > : < | v : d

Epoch 42:


  4%|▍         | 43/1000 [17:09<6:21:16, 23.90s/it]

Best score: -200.0
v : d | d : v : ^
> : > : > : > : v
^ : v : d : < : v
> | < : v | d : <
d | < : p | v : d

Epoch 43:


  4%|▍         | 44/1000 [17:33<6:19:15, 23.80s/it]

Best score: -200.0
p : v | d : < : ^
^ : > : > : > : v
^ : p : > : < : ^
> | < : v | ^ : <
^ | p : p | d : d

Epoch 44:


  4%|▍         | 45/1000 [17:57<6:19:17, 23.83s/it]

Best score: -200.0
> : p | p : < : >
> : > : > : > : p
^ : v : < : p : ^
> | < : v | d : >
^ | p : < | v : d

Epoch 45:


  5%|▍         | 46/1000 [18:21<6:20:44, 23.95s/it]

Best score: -200.0
^ : v | p : < : >
^ : > : > : > : v
^ : p : > : p : v
> | < : p | < : >
^ | > : < | v : d

Epoch 46:
We found 1 duplicated policies at this epoch!


  5%|▍         | 47/1000 [18:46<6:24:47, 24.23s/it]

Best score: -200.0
> : d | < : < : >
^ : > : > : > : v
^ : v : > : p : ^
> | < : v | d : >
^ | > : < | v : d

Epoch 47:


  5%|▍         | 48/1000 [19:11<6:26:55, 24.39s/it]

Best score: -200.0
< : > | ^ : < : ^
> : > : > : > : v
^ : v : < : < : v
< | < : v | d : <
^ | > : p | v : d

Epoch 48:
We found 1 duplicated policies at this epoch!


  5%|▍         | 49/1000 [19:35<6:27:33, 24.45s/it]

Best score: -200.0
v : ^ | ^ : < : ^
^ : > : > : > : v
^ : v : < : < : ^
> | < : v | d : >
^ | p : < | v : d

Epoch 49:


  5%|▌         | 50/1000 [19:59<6:25:19, 24.34s/it]

Best score: -200.0
d : ^ | v : < : >
^ : > : > : p : v
d : p : > : < : v
< | < : v | < : >
d | > : p | v : d

Epoch 50:
We found 1 duplicated policies at this epoch!


  5%|▌         | 51/1000 [20:24<6:25:12, 24.35s/it]

Best score: -200.0
< : d | p : < : >
> : > : > : p : v
d : v : < : < : v
> | < : v | ^ : <
^ | < : < | v : d

Epoch 51:


  5%|▌         | 52/1000 [20:48<6:23:27, 24.27s/it]

Best score: -200.0
p : > | > : < : >
^ : > : > : > : v
^ : v : > : < : v
> | < : v | d : >
d | > : < | v : d

Epoch 52:
We found 1 duplicated policies at this epoch!


  5%|▌         | 53/1000 [21:12<6:25:13, 24.41s/it]

Best score: -200.0
> : p | > : < : >
^ : > : > : > : v
^ : v : < : < : v
> | < : v | d : >
d | > : p | v : d

Epoch 53:
We found 1 duplicated policies at this epoch!


  5%|▌         | 54/1000 [21:37<6:26:17, 24.50s/it]

Best score: -200.0
> : > | < : v : >
^ : > : > : p : v
^ : p : > : p : v
< | < : v | d : >
d | < : < | v : d

Epoch 54:


  6%|▌         | 55/1000 [22:01<6:24:49, 24.43s/it]

Best score: -200.0
p : > | > : < : >
> : > : > : > : p
^ : v : > : p : ^
> | < : v | d : >
d | > : < | v : d

Epoch 55:
We found 2 duplicated policies at this epoch!


  6%|▌         | 56/1000 [22:26<6:23:25, 24.37s/it]

Best score: -200.0
< : ^ | ^ : < : >
^ : > : > : > : p
^ : v : > : < : v
> | < : v | d : >
d | > : p | v : d

Epoch 56:
We found 1 duplicated policies at this epoch!


  6%|▌         | 57/1000 [22:51<6:26:59, 24.62s/it]

Best score: -200.0
p : > | > : < : >
> : > : > : > : v
^ : v : < : p : v
> | < : v | d : >
^ | > : p | v : d

Epoch 57:
We found 1 duplicated policies at this epoch!


  6%|▌         | 58/1000 [23:15<6:25:56, 24.58s/it]

Best score: -200.0
d : p | p : < : >
> : > : > : > : p
^ : v : < : p : v
> | < : v | d : >
^ | > : < | v : d

Epoch 58:


  6%|▌         | 59/1000 [23:39<6:22:47, 24.41s/it]

Best score: -200.0
> : > | < : v : >
> : > : > : > : v
d : p : < : p : ^
> | < : p | d : <
^ | p : < | d : d

Epoch 59:


  6%|▌         | 60/1000 [24:03<6:18:43, 24.17s/it]

Best score: -200.0
v : > | p : < : >
> : > : > : > : v
^ : v : < : p : v
< | < : v | d : >
^ | > : < | v : d

Epoch 60:


  6%|▌         | 61/1000 [24:27<6:16:43, 24.07s/it]

Best score: -200.0
v : ^ | v : < : >
^ : > : > : > : p
^ : v : > : < : ^
> | < : v | d : >
^ | > : < | v : d

Epoch 61:


KeyboardInterrupt: 

In [None]:
#from gym import wrappers
#env = gym.make('CartPole-v0')
#env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1')
#for i_episode in range(20):
#    observation = env.reset()
#    for t in range(100):
#        env.render()
#        print(observation)
#        action = env.action_space.sample()
#        observation, reward, done, info = env.step(action)
#        if done:
#            print("Episode finished after {} timesteps".format(t+1))
#            break