## \* This notebook is a continuation of *[FrozenLake.ipynb](./frozenlake.ipynb)*
## \* The first part of the code is all the same
## \* Starting *@[moar](#moar)* you'll find whats changed

# moar

The parameters of the genetic algorithm aren't optimal, try to find something better. (size, crossovers and mutations)

Try alternative crossover and mutation strategies

* prioritize crossover for higher-scorers?

# ·  try to select a more diverse pool, not just best scorers?  (we'll do this one here)
* Just tune the f*cking probabilities.

See which combination works best!

### Quick explanation

We'll do the same best scores to crossover as before, but we'll change two things while selecting best scorers:
1. The policies need to be unique.
2. With some probability, we'll select a random policy in the final pool but not into the best scorers. (this didn't worked so well. I just keep the code as experimentation.)
3. mutations will not change the actual policy selected, so we'll have a policy x and its mutated policy y, and keep both of them

Jump to [Diverse Pool, where we change the code to do that (click here)](#Diverse-Pool)

# Preparations
**Similar code as before:**

In [1]:
import gym

#create a single game instance
env = gym.make("FrozenLake-v0")

#start new game
env.reset();

[2017-03-12 20:42:37,853] Making new env: FrozenLake-v0


We'll modify get_random_policy() to return not repeated policies in the pool.

If no pool is passed to this function, it will return any random policy, as before.

**NOTE**: We'll NOT use this in this notebook (maybe never). I find it very ugly to do this in this way. It requires to modify lot of code, so I'll just remove duplicates in the final pool. Maybe it's not the best, but it's easier to understand and the final product will be (probably) very similar.

In [2]:
import numpy as np
n_states = env.observation_space.n
n_actions = env.action_space.n
def get_random_policy(pool=[]):
    """
    Build a numpy array representing agent policy.
    This array must have one element per each of 16 environment states.
    Element must be an integer from 0 to 3, representing action
    to take from that state.
    """
    # randint(0, 4, 16) returns an array of integers between 0 and 3 inclusive
    # (or, in other words, starting from 0 to below 4)
    # and the third therm (16), will be the array size
    rand_pol = np.random.randint(0, n_actions, n_states)
    while rand_pol in pool:
        #it will loop until a NEW random policy appear
        print("rand_pol:", rand_pol)
        print("pool: ", pool)
        rand_pol = np.random.randint(0, n_actions, n_states)
        
    return rand_pol

In [3]:
np.random.seed(1234)
policies = [get_random_policy() for i in range(10**4)]
assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == n_actions-1, 'maximal action id should match n_actions-1'
action_probas = np.unique(policies, return_counts=True)[-1] /10**4. /n_states
print("Action frequencies over 10^4 samples:",action_probas)
assert np.allclose(action_probas, [1. / n_actions] * n_actions, atol=0.05), "The policies aren't uniformly random (maybe it's just an extremely bad luck)"
print("Seems fine!")

Action frequencies over 10^4 samples: [ 0.25014375  0.25130625  0.2495375   0.2490125 ]
Seems fine!


### Let's evaluate!
* Implement a simple function that runs one game and returns the total reward

In [4]:
def sample_reward(env, policy, t_max=100):
    """
    Interact with an environment, return sum of all rewards.
    If game doesn't end on t_max (e.g. agent walks into a wall), 
    force end the game and return whatever reward you got so far.
    Tip: see signature of env.step(...) method above.
    """
    #s: state; where our actor is
    s = env.reset()
    total_reward = 0

    for t in range(t_max):
        # p = policy: with probabilities equal to the ones returned by get_random_policy()
        s, reward, is_done, _ = env.step(policy[s])
        # accumulating rewards
        total_reward += reward
    return total_reward

In [5]:
print("generating 10^3 sessions...")
rewards = [sample_reward(env,get_random_policy()) for _ in range(10**3)]
assert all([type(r) in (int, float) for r in rewards]), 'sample_reward must return a single number'
assert all([0 <= r <= 1 for r in rewards]), 'total rewards should be between 0 and 1 for frozenlake (if solving taxi, delete this line)'
print("Looks good!")

generating 10^3 sessions...
Looks good!


In [6]:
def evaluate(policy, n_times=100):
    """Run several evaluations and average the score the policy gets."""
    # rewards: array with n_times (100) elements consisting of the total_rewards returned by sample_reward()
    rewards = [sample_reward(env, policy) for n in range(n_times)]
    return float(np.mean(rewards))

#### Ignoring the random search, jumping right into
# Part II - Genetic Algorithms

In [7]:
def print_policy(policy):
    """a function that displays a policy in a human-readable way."""
    lake = "SFFFFHFHFFFHHFFG"
    assert env.spec.id == "FrozenLake-v0", "this function only works with frozenlake 4x4"

    
    # where to move from each tile (we're a bit unsure if this is accurate)
    arrows = ['>^v<'[a] for a in policy]
    
    #draw arrows above S and F only
    signs = [arrow if tile in "SF" else tile for arrow, tile in zip(arrows, lake)]
    
    for i in range(0, 16, 4):
        print(' '.join(signs[i:i+4]))

print("random policy:")
print_policy(get_random_policy())

random policy:
v > > v
^ H v H
v < v H
H v > G


In [8]:
def crossover(policy1, policy2, p=0.5):
    """
    for each state, with probability p take action from policy1, else policy2
    """
    # policyx: [0,1,3,2,1,0,3,2,1,0,3,2,0,2,2,1]
    new_pol = []
    for i in range(len(policy1)):
        #choosing between the ith element between pol1 and pol2 with probability p
        new_pol.append(np.random.choice((policy1[i], policy2[i]), p=[p, 1-p]))
        
    return new_pol

We'll modify this function to return a copy of the mutated policy and not replace it directly on the selected policy

In [15]:
def mutation(policy, p=0.01):
    """
    for each state, with probability p replace action with random action
    Tip: mutation can be written as crossover with random policy
    """
    # if we modify "policy" directly, we'll change the value of policy. Lists work that way, so we
    # need to use a copy
    mutated_policy = list(policy)
    #n_actions = env.action_space.n
    for a in policy:
        # with 1% probability, we mutate element a from policy
        if np.random.choice((0,1), p=[1-p, p]):
            mutated_policy[a] = np.random.randint(0, n_actions)
    return mutated_policy

In [16]:
np.random.seed(1234)
policies = [crossover(get_random_policy(), get_random_policy()) 
            for i in range(10**4)]

assert all([len(p) == n_states for p in policies]), 'policy length should always be 16'
assert np.min(policies) == 0, 'minimal action id should be 0'
assert np.max(policies) == n_actions-1, 'maximal action id should be n_actions-1'

assert any([np.mean(crossover(np.zeros(n_states), np.ones(n_states))) not in (0, 1)
               for _ in range(100)]), "Make sure your crossover changes each action independently"
print("Seems fine!")

Seems fine!


We'll add n_low_scorers to control how many random low scorers will be added to the final pool.

This was an experimentation. It result as pure shit, so now the default is 0

In [21]:
n_epochs = 20 #default: 100 - how many cycles to make
pool_size = 100 #how many policies to maintain
n_crossovers = 50 #how many crossovers to make on each step
n_mutations = 50 #how many mutations to make on each tick
n_low_scorers = 0 #how many random low-scorer policies, not into the best scorers, can pass to the next pool

In [22]:
print("initializing...")
#pool = <spawn a list of pool_size random policies>
pool = [get_random_policy() for i in range(pool_size)]
#pool_scores = <evaluate every policy in the pool, return list of scores>
pool_scores = [evaluate(p) for p in pool]

initializing...


In [23]:
assert type(pool) == type(pool_scores) == list
assert len(pool) == len(pool_scores) == pool_size
assert all([type(score) in (float, int) for score in pool_scores])

# Diverse Pool

## Changing some code below here.
### crossovered now prioritize higher scores as 3.-moar
### now we'll:
#### 1. Remove duplicates
#### 2. With probability p, select a random policy from the final pool but not one into the best scorers.
#### 3. mutations will not change the actual policy selected, so we'll have a policy x and its mutated policy y, and keep both of them
#### 4. crossovers will be between a high scorer and a random policy from the pool, to prevent crossovers between very similar policies 

In [24]:
#main loop

for epoch in range(n_epochs):
    print("Epoch %s:"%epoch)
    
    # 1. Removing duplicates from pool
    # converting list of np arrays to list of tuples because I couldn't find a better way to remove duplicates
    uniques_pool = [tuple(policy) for policy in pool]
    # set() returns only unique elements. We need to convert them to np arrays again
    uniques_pool = [np.asarray(policy) for policy in set(uniques_pool)]
    if len(uniques_pool) != len(pool):
        # We found some duplicates
        print("We found", len(pool) - len(uniques_pool), "duplicated policies at this epoch!")
        pool = uniques_pool
    # We could check for duplicates on crossovered or mutated,
    #but it's more similar code and I want to finish this one
    
    # evaluation policies before crossovering them:
    pool_scores = [evaluate(p) for p in pool]
    # we'll select the best n_crossovers (50) policies to mix between them
    # we could use another number of best policies instead of n_crossovers (50),
    # but it's late and I don't know what I'm doing
    selected_indices = np.argsort(pool_scores)[-n_crossovers:]
    
    #crossovered = <crossover random guys from pool, n_crossovers total>
    # using selected_indices as a fraction of pool with 50 best scores
    crossovered = [crossover(pool[np.random.choice(selected_indices)], 
                             pool[np.random.choice(len(pool))]) 
                   for c in range(n_crossovers)]
    # from now on it's all the same: mutations, adding all to a pool, evaluating (again) and selecting
    # best scores. Repeat for n_epochs.
    #mutated = <add several new policies at random, n_mutations total>
    mutated = [mutation(pool[np.random.choice(len(pool))]) 
               for m in range(n_mutations)]
    
    assert type(crossovered) == type(mutated) == list
    
    #add new policies to the pool
    #pool = <add up old population with crossovers/mutations>
    #plus sing (+) concatenates lists in python
    pool = pool + crossovered + mutated
    #pool_scores = <evaluate all policies again>
    pool_scores = [evaluate(p) for p in pool]
    
    # 2. Adding a couple of random low scorers to the final pool
    # select pool_size-n_low_scorers best policies. we'll add n_low_scorers later 
    selected_indices = np.argsort(pool_scores)[-pool_size+n_low_scorers:]
    # Now we need to add n_low_scorers to our indices
    # np.argsort(pool_scores)[:-pool_size+n_low_scorers] will contain all indices NOT used abobe
    # so we need to choose n_low_scorers random indices from there
    low_scorers_indices = np.random.choice(np.argsort(pool_scores)[:-pool_size+n_low_scorers], n_low_scorers)
    # now we need to concatenate all indices into one numpy array
    selected_indices = np.concatenate((selected_indices, low_scorers_indices))
    #filling pool only with best values
    pool = [pool[i] for i in selected_indices]
    pool_scores = [pool_scores[i] for i in selected_indices]

    #print the best policy so far (last in ascending score order)
    print("Best score:", pool_scores[-1])
    print_policy(pool[-1])
    print("")

Epoch 0:
Best score: 0.15
> < < <
v H v H
v ^ ^ H
H v ^ G

Epoch 1:
We found 15 duplicated policies at this epoch!
Best score: 0.15
< < > v
< H > H
v v ^ H
H < v G

Epoch 2:
We found 16 duplicated policies at this epoch!
Best score: 0.21
^ < > v
> H > H
v v > H
H v < G

Epoch 3:
We found 23 duplicated policies at this epoch!
Best score: 0.21
< < > >
^ H ^ H
v v > H
H v ^ G

Epoch 4:
We found 15 duplicated policies at this epoch!
Best score: 0.37
^ < > >
> H > H
< ^ > H
H ^ v G

Epoch 5:
We found 27 duplicated policies at this epoch!
Best score: 0.33
> v > <
> H v H
< ^ ^ H
H ^ ^ G

Epoch 6:
We found 18 duplicated policies at this epoch!
Best score: 0.42
^ < > >
> H > H
< ^ > H
H ^ v G

Epoch 7:
We found 16 duplicated policies at this epoch!
Best score: 0.32
^ < > >
> H > H
< ^ > H
H ^ v G

Epoch 8:
We found 15 duplicated policies at this epoch!
Best score: 0.64
> < > <
> H v H
< ^ ^ H
H v ^ G

Epoch 9:
We found 22 duplicated policies at this epoch!
Best score: 0.61
> < > <
> H v H
< ^ 