### Optimizing a scikit model with strange ways

We'll now be trying to maximize some aspect of a specific machine learning model. I've chosen for that to be accuracy.

Requires the Diabetes dataset file from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

Author: Raido Everest

In [None]:
import random
from collections import defaultdict
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [None]:
data = pd.read_csv('diabetes.csv')  # Just going to use these like globals. YOLO
X = data.drop('Outcome', axis=1).values
y = data[['Outcome']].values.ravel()

### Common stuff and comments

We've got a few more challenges here, such as discrete variables in the search space, limited range variables in the search space (errors if we'd somehow end up testing zero predictors), and a way more expensive fitness calculation than we're used to (since it is essentially necessary to train and test a model to be able to get it).

I'll try to do some weird kind of discretization in the case of continuous variables to reduce search space a bit - and to possibly make memoization of results feasible. So each numeric feature will have their own min, max, and step sizes. Might actually even work okay.

The parameters of the RF that we'll use are max_depth, ccp_alpha, min_impurity_decrease, so three dimensions in the search space.

PS: Absolutely not as flexible as it might first look - it's still written around just this one model.

In [None]:
# Common to all...
# Ranges takes care of the proper input ranges. If we leave these bounds, we project back.
# Minimum, maximum, 'step size' in decimal points.
ranges = [[2, 20, 0], [0.0, 1.0, 2], [0.0, 0.5, 2]]

# Creates n parameter sets with given ranges for the discussed numbers.
def create_population(n, ranges):
    models = []
    for _ in range(n):
        models.append([gen_param(*dist) for dist in ranges])
    return models

# Generates a random parameter
def gen_param(start, end, decimals):
    rnd = round(random.random() * (end - start) + start, decimals)
    return int(rnd) if decimals == 0 else rnd

# Creates a model based on generated parameters
def create_model(max_depth, ccp_alpha, min_impurity_decrease):
    return RandomForestClassifier(max_depth=max_depth, ccp_alpha=ccp_alpha, min_impurity_decrease=min_impurity_decrease,
                                 random_state=42)

# Updates memory, returns the best new one it found (each method will use same kind of memory)
# This is, of course, an awful name for the function, for its return values aren't really going to be expected
def update_memory(memory, pop):
    best, bestacc = None, 0.0
    for params in pop:
        # If model hasn't been evaluated, evaluate it.
        a,b,c = params
        if memory[a][b][c] == None:
            model = create_model(a,b,c)
            score = cross_val_score(model, X, y, cv=4).mean()
            memory[a][b][c] = score
            if score > bestacc:
                best, bestacc = params, score
    return best, bestacc

# Given some next generation, fix it to be within the limited search space
def fix_generation(pop, ranges):
    for params in pop:
        for i in range(len(params)):
            # Rounds each set of parameters to be within the wanted steps
            params[i] = round(params[i], ranges[i][2])
            # Projects them inside if they are outside the search space
            params[i] = max(params[i], ranges[i][0])  # Raises it to the minimum at least
            params[i] = min(params[i], ranges[i][1])  # Lowers it to the maximum at most

def plot(data, title):
    sns.lineplot(data = data)
    plt.xlabel('Generation')
    plt.ylabel('Best value')
    plt.title(title)
    plt.show()

In [None]:
# Example population of model parameters
create_population(4, ranges)

In [None]:
# Any of these can be fed into create_model as create_model(*elem).
create_model(*create_population(1, ranges)[0])

### Differential Evolution
#### Implementation

In [None]:
# ranges  - ranges for the model parameters
# n       - population size
# scaling - scaling during generating next gen
# loops   - how many loops with no improvement to try
def de(ranges, n=30, scaling=0.5, loops=20):
    # Since some things are discretized now, we may be able to avoid
    # recalculating accuracy by remembering it.
    memory = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: None)))

    # Initial model set
    pop = create_population(n, ranges)
    best, bestacc = update_memory(memory, pop)
    history = [bestacc]
    
    # The main loop...
    loops_since_improvement = 0
    while loops_since_improvement < loops:
        loops_since_improvement += 1
        nextpop = de_next(pop, ranges, n, scaling)
        nextbest, nextbestacc = update_memory(memory, nextpop)
        if nextbestacc > bestacc:
            loops_since_improvement = 0
            best, bestacc = nextbest, nextbestacc
        history.append(bestacc)
        # overwrite pop with better ones from nextpop
        de_updatepop(pop, nextpop, memory)

    return best, bestacc, history

# ...
def de_next(pop, ranges, n, scaling):
    nextpop = []
    for i in range(n):
        a, b = random.choice(pop), random.choice(pop)  # Picking two random elements
        nextpop.append([pop[i][j]+scaling*(b[j]-a[j]) for j in range(len(a))])  # Add the scaled ab vector to pop[i]...
        
    fix_generation(nextpop, ranges)  # Solve some problems
    return nextpop

# ...
def de_updatepop(pop, nextpop, memory):
    # Now that everything in pop and nextpop are guaranteed to be in memory,
    # this couldn't be easier.
    for i in range(len(pop)):
        a1,b1,c1 = pop[i]
        a2,b2,c2 = nextpop[i]
        if memory[a2][b2][c2] > memory[a1][b1][c1]:
            pop[i] = nextpop[i]

#### Sanity check

Seems to do okay and get a considered-good accuracy in the end, however the scaling has to be quite high (even above one for best outcome speed wise) for it to really go to a good place (and to be truly performant - by smashing into the corners of some variable range it's a lot more likely to try to compute something we've already computed). Feels strange to do it like this, I'll test it with several configs later.

In [None]:
best, bestval, his = de(ranges, 15, 1.0, 10)
(best, bestval)

#### Accuracy graph over time

We find we get the best model VERY early though - in one or two generations it won't move any more. May be because of the chosen parameters, may be the data set.

In [None]:
plot(his, "Differential Evolution")

#### Measuring time

So we know we spend most of the time for no reason. So let's measure and contrast the time this method takes with waiting ~3 loops for improvement and ~10 like above to see if we check many models we haven't made yet. (Also freeze the seed because otherwise things may get weird)

In [None]:
random.seed(42)
%timeit -r 1 -n 1 de(ranges, 15, 1.0, 3)

In [None]:
random.seed(42)
%timeit -r 1 -n 1 de(ranges, 15, 1.0, 10)

Doesn't seem like adding seven extra loops changed runtime so much, suggesting we do get a lot of collisions.

In [None]:
random.seed(42)
%timeit -r 1 -n 1 de(ranges, 15, 3.0, 3)  # Very high scaling - makes big jumps

Bigger jumps seem to have cut down on processing time some.