In [14]:
# Imports as always...
from feature_selection import find_optimal_features
from fitness import fitness_function

import random
import numpy as np

## Which Features are Selected?

Firstly, we can run our algorithm a number of times to see what proportion of each 'best feature selection' include each feature. That is, we will find the probability that each feature is selected by our algorithm, and we do this by observation (because it is an evolutionary algorithm, so there really isn't a better way to do this!). 

In [3]:
counts = [0] * 30
algorithm_runs = 100

for i in range(algorithm_runs):
    selection = find_optimal_features()

    for j in range(30):
        counts[j] += selection[j]

probabilities = [count / algorithm_runs for count in counts]
probabilities

[0.53,
 0.77,
 0.54,
 0.62,
 0.25,
 0.67,
 0.75,
 0.63,
 0.1,
 0.66,
 0.35,
 0.41,
 0.26,
 0.36,
 0.19,
 0.43,
 0.48,
 0.46,
 0.47,
 0.15,
 0.55,
 0.54,
 0.63,
 0.62,
 0.98,
 0.55,
 0.38,
 0.48,
 0.6,
 0.57]

We can see that there are some features that are rarely chosen and some that are almost always chosen, but most are selected almost randomly. Let's show more explicitly how we might interpret the above.

In [4]:
interpretations = []
for prob in probabilities:
    if prob < 0.2:
        interpretations.append('Almost never')
    elif prob < 0.4:
        interpretations.append('Infrequently')
    elif prob <= 0.6:
        interpretations.append('Pretty much at random')
    elif prob < 0.8:
        interpretations.append('Frequently')
    else:
        interpretations.append('Almost always')

interpretations

['Pretty much at random',
 'Frequently',
 'Pretty much at random',
 'Frequently',
 'Infrequently',
 'Frequently',
 'Frequently',
 'Frequently',
 'Almost never',
 'Frequently',
 'Infrequently',
 'Pretty much at random',
 'Infrequently',
 'Infrequently',
 'Almost never',
 'Pretty much at random',
 'Pretty much at random',
 'Pretty much at random',
 'Pretty much at random',
 'Almost never',
 'Pretty much at random',
 'Pretty much at random',
 'Frequently',
 'Frequently',
 'Almost always',
 'Pretty much at random',
 'Infrequently',
 'Pretty much at random',
 'Pretty much at random',
 'Pretty much at random']

This particular dataset is made up of ten observations split into three features (mean, standard error, and worst), and so we expect lots of inter-correlation. And so, it is perhaps not surprising to see some randomness in the selection -- including one feature of an observation means you can get away with leaving out the other two (e.g. selecting the worst means you can get away with leaving out the mean and worst), so our algorithm settles on one of them pretty much at random.

This isn't always the case, as we are very sure that a few of these features should always have the same decision (selection or not selection), and perhaps that has an implication on its inter-correlated features.

## Comparison with Nothing and Randomness

Let's now see how much better our algorithm makes the model over *no feature selection* (i.e. all features) and over *random selection* (which we will check by randomly sampling several combinations of features and averaging the model's performance on each).

In [16]:
# Our selection's fitness...
our_selection = find_optimal_features()
our_fitness = fitness_function(our_selection)
print("Fitness of our algorithm's feature selection:", our_fitness)

Fitness of our algorithm's feature selection: 0.9912280701754386


In [19]:
# Fitness of no selection.
no_selection = [1] * 30
no_fitness = fitness_function(no_selection)
print("Fitness of no feature selection:", no_fitness) 

print("We improve on no feature selection by {}%.".format(
    round(((our_fitness / no_fitness) - 1) * 100, 3)
))

Fitness of no feature selection: 0.956140350877193
We improve on no feature selection by 3.67%.


In [20]:
# Fitness of random selection.
fitnesses = []
for i in range(100):
    random_selection = [random.randint(0, 1) for i in range(30)]
    fitnesses.append(fitness_function(random_selection))
random_fitness = np.mean(fitnesses)
print("Fitness of random feature seleciton:", random_fitness)

print("We improve on random feature selection by {}%.".format(
    round(((our_fitness / random_fitness) - 1) * 100, 3)
))

Fitness of random feature seleciton: 0.9551754385964912
We improve on random feature selection by 3.774%.


In [6]:
print('Almost never count:', len([i for i in interpretations if i == 'Almost never']))
print('Infrequently count:', len([i for i in interpretations if i == 'Infrequently']))
print('Pretty much at random count:', len([i for i in interpretations if i == 'Pretty much at random']))
print('Frequently count:', len([i for i in interpretations if i == 'Frequently']))
print('Almost always count:', len([i for i in interpretations if i == 'Almost always']))

Almost never count: 3
Infrequently count: 5
Pretty much at random count: 13
Frequently count: 8
Almost always count: 1
