# Note: This was the original attempt using DEAP

## Realigning a Toastmasters District Using a Genetic Algorithm

This notebook takes cleaned district data and uses the DEAP library to realign the district into areas.

The fitness function optimizes the variation (standard deviation) of the club quality scores, minimizes the normalized average distance between the clubs, and optimizes the number of clubs which must be 4-6 with an ideal of 5.

The club quality score is equal parts:
1. A normalized absolute variation from 25 members - minimized,
2. A normalized awards per member (participation) - maximized, and
3. A normalized absolute variation of new members from 20% (retention) - minimized.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import contextily as ctx
from itertools import combinations
import random
from deap import creator, base, tools, algorithms

# Prepare Data

In [2]:
# Import data - only quality score needed for algorithm
clubs = pd.read_csv('clubs_to_realign.csv')
clubs.head()

Unnamed: 0,club_no,n_quality
0,5509,0.677709
1,7036,0.695792
2,9682,0.614715
3,584009,0.4759
4,1100434,0.768516


We'll be using the distance matrix we created earlier. Let's import it and set it up.

In [3]:
dist = pd.read_csv('club_distance_matrix.csv')
dist.set_index('club_no', inplace=True)
# Make sure the columns are integers too
dist.columns = dist.columns.astype(int)
dist.head()

Unnamed: 0_level_0,5509,7036,9682,584009,1100434,718,4819,9790,5069647,7575630,...,1783,596735,4700632,5258000,2690,8569,3929213,4822437,5569,1565753
club_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5509,0.0,0.0,0.309451,0.31531,0.0,0.309451,0.338685,0.297754,0.347098,0.311623,...,0.752496,0.737553,0.748308,0.748308,0.748079,0.748308,0.748308,0.749572,0.748308,0.748308
7036,0.0,0.0,0.309451,0.31531,0.0,0.309451,0.338685,0.297754,0.347098,0.311623,...,0.752496,0.737553,0.748308,0.748308,0.748079,0.748308,0.748308,0.749572,0.748308,0.748308
9682,0.360002,0.360002,0.0,0.033326,0.360002,0.0,0.035332,0.024728,0.100703,0.011137,...,0.896155,0.892741,0.894353,0.894353,0.893528,0.894353,0.894353,0.895944,0.894353,0.894353
584009,0.378986,0.378986,0.034432,0.0,0.378986,0.034432,0.047424,0.024879,0.070442,0.045275,...,0.927359,0.924798,0.926084,0.926084,0.925535,0.926084,0.926084,0.927159,0.926084,0.926084
1100434,0.0,0.0,0.309451,0.31531,0.0,0.309451,0.338685,0.297754,0.347098,0.311623,...,0.752496,0.737553,0.748308,0.748308,0.748079,0.748308,0.748308,0.749572,0.748308,0.748308


Since we're optimizing standard deviation, let's consider what an optimum would be. Ideally, there would be a good variety of strong and weak clubs, not all strong or all weak. All of one or the other would be a standard deviation of 0. Half strong and half weak would be 0.5 (from a range of 0 to 1). The ideal is something in between.

In [4]:
# This is a hypothetical area as a list of club scores
# ranging from our normalized 0 to 1 scale. 
# This would be a well-balanced area.
ideal = [0.0, 0.25, 0.5, 0.75, 1.0]
# Calculate ideal standard deviation.
std = round(np.std(ideal), 3)
print(f'An ideal distribution: {std}')

An ideal distribution: 0.354


### Functions to Group and Format Areas

DEAP can only register function names so all additional formatting needs to be done before that.

Since we want all areas to be 5 clubs, if we end up with less than 4 we need to make the last 2-4 areas have 4 clubs instead. Here we make a district encoded as a list of areas with lists of clubs. Input is a dataframe of clubs. Set clubs to default so that we don't need to add the parameter when registering it in the DEAP toolbox. We will also use the grouping function during crossover so we need a randomizing parameter to be able to turn it off at that time. 

In [43]:
def group_areas(clubs=clubs, randomize=False):
    if randomize == True:
        random.shuffle(clubs)
    
    # Check how many clubs would be left in last area
    if len(clubs) % 5 in [0, 4]:
        for i in range(0, len(clubs), 5): 
            yield clubs[i:i + 5]
        
    elif len(clubs) % 5 == 3:
        for i in range(0, len(clubs)-8, 5):
            yield clubs[i:i + 5]
        for i in range(len(clubs)-8, len(clubs), 4):
            yield clubs[i:i + 4]
        
    elif len(clubs) % 5 == 2:
        for i in range(0, len(clubs)-12, 5):
            yield clubs[i:i + 5]
        for i in range(len(clubs)-12, len(clubs), 4):
            yield clubs[i:i + 4]
                
    else: # Remaining cases have 1 left over club
        for i in range(0, len(clubs)-16, 5):
            yield clubs[i:i + 5]
        for i in range(len(clubs)-16, len(clubs), 4):
            yield clubs[i:i + 4]

def areas_list(clubs=clubs):
    return list(group_areas(list(clubs['club_no']), randomize=True))

Test our areas_list function on a list of our clubs. Notice the last two areas have 4 clubs each.

In [44]:
district1 = areas_list(clubs)
district1

[[6970706, 7384295, 1526701, 6632930, 7575630],
 [6990556, 7036, 6590, 997315, 6861322],
 [1408278, 9354, 4822437, 1171779, 4700632],
 [4793192, 8412, 4858, 4750107, 5859633],
 [3401898, 9790, 7059, 9019, 7031829],
 [1171849, 1412885, 4182, 1176575, 1036600],
 [8552, 7274, 7479409, 4107, 614471],
 [6754191, 3812934, 2690, 4110, 7022029],
 [4786679, 584516, 8952, 6887806, 1581643],
 [9214, 1995527, 1100434, 4095, 4015],
 [9598, 437, 7532701, 7533, 3356972],
 [7479372, 5869106, 4801055, 1463775, 7402713],
 [5736, 1588444, 3074518, 5042512, 630505],
 [3549, 1582927, 4533, 8941, 1158551],
 [2556863, 8631, 3408653, 4446, 607240],
 [6820584, 2923054, 2364, 1291183, 3431353],
 [992874, 6613239, 7306126, 1331602, 6661],
 [2912, 5674, 6975086, 8041, 9469],
 [1595518, 3318, 2808, 8569, 6786],
 [4108, 3859, 6644914, 4819, 3372438],
 [5981, 695532, 7452, 6380, 1176566],
 [9872, 1190, 3063370, 4718634, 3240871],
 [2038660, 5258000, 2189079, 7881, 6523],
 [967, 1165752, 6715494, 6891000, 5918],
 [55

### Build evaluation function
- We will make sure most areas have 5 clubs through the `areas_list` function. 
- Our next criterion is minimizing the distance from the ideal quality distribution.
- The final criterion is minimizing the distance between clubs.
- Write functions for both calculations and call them both in the evaluation function.
- Keep the two functions separate in case we decide to change the weights later.
- Where parameters are constant (i.e. clubs or dist), set them as the default to work with the DEAP library better. 

In [45]:
# Function to calculate quality distribution
# Specifically, we're going to minimize the average distance 
# of the standard deviation from 0.35 which is an approximately 
# ideal quality distribution of 5 clubs

# This function returns a list of quality scores from a list of club numbers
def area_qualities(area, clubs=clubs):
    return [clubs[clubs['club_no'] == club]['n_quality'].iloc[0] for club in area]

# This function returns the average distance from the ideal distribution for the district
def quality_std(district, clubs=clubs):
    return sum([abs(np.std(area_qualities(area, clubs))-0.35) for area in district]) / len(district)

In [46]:
# Function to calculate average distance between all clubs in each area for the district

# This function calculuates average distance in one area.
def area_dist(clubs, dist=dist):
    return sum([dist.loc[pair] for pair in list(combinations(clubs, 2))]) / \
            len(list(combinations(clubs, 2)))

# This function calculates average distance of all areas
def district_dist(district, dist=dist):
    return sum([area_dist(area, dist) for area in district]) / len(district)

In [47]:
# Evaluation Function
# This returns a tuple representing the quality distribution and
# average physical distance between clubs
def evaluate_district(district, clubs=clubs, dist=dist):
    return quality_std(district, clubs), district_dist(district, dist)

### Build Mating Function
DEAP only has crossover functions that accept strings, not groups. 
- Here we'll implement a variation of the one-point crossover. 
- Offspring 1 carries forward the first cut of parent 1.
- Add any remaining non-duplicating complete areas from parent 2. 
- Fill in the remaining areas with missing clubs sequenced from parent 1.
- Repeat for offspring 2.

Create a helper flattening function.

In [48]:
def flatten(district):
    return [club for area in district for club in area]

# Regroup: list(group_areas(clubs_list))

In [49]:
# Create a parents for testing
#d1 = list(group_areas(list(range(99)), randomize=True))
#d2 = list(group_areas(list(range(99)), randomize=True))
district2 = areas_list(clubs)
district2

[[4750107, 4446, 596735, 1588444, 6661],
 [1190, 4822437, 5869106, 1176566, 6654663],
 [4801055, 7327347, 8631, 6786, 2146],
 [9214, 3356972, 7059, 4786679, 6820584],
 [3761504, 3318, 6142, 713, 6590],
 [4182, 5981, 8853, 6887806, 6975086],
 [9682, 5509, 997315, 8363, 7554675],
 [6523, 1565753, 1207, 7463287, 7384295],
 [695532, 1100434, 4721, 9598, 3929213],
 [4700632, 7587, 3812934, 437, 967],
 [1036600, 6990556, 718, 1412885, 4819],
 [7817, 845547, 2556863, 6380, 7022029],
 [9019, 5928, 7031829, 5055, 9469],
 [3401898, 1291183, 2808, 4718634, 8412],
 [7479372, 5569, 5042512, 953233, 2690],
 [9872, 6754191, 3063370, 8983, 2923054],
 [1244830, 7036, 1595518, 4015, 1171779],
 [4095, 1582927, 7306126, 8941, 730163],
 [6071, 5553533, 6970706, 7452, 3395235],
 [5918, 8952, 1165752, 5736, 7532701],
 [5674, 583467, 614471, 1171849, 1581643],
 [6861322, 5069647, 6891000, 8552, 3408653],
 [4154, 3372438, 5112712, 584516, 1331602],
 [7881, 1176575, 5859633, 6715494, 9354],
 [6613239, 4793192, 

In [51]:
# Mating function
def group_crossover(p1, p2):
    # Find the cut-point
    cut = int(len(p1)/2)
    # Make ordered lists of club numbers
    p1_list = flatten(p1)
    p2_list = flatten(p2)
    
    # Initialize offspring
    o1_cut = p1[:cut]
    o2_cut = p2[:cut]
    
    # Add any remaining valid areas from second parent
    p2_rem = [area for area in p2 if not any(club in area for club in flatten(o1_cut))]
    p1_rem = [area for area in p1 if not any(club in area for club in flatten(o2_cut))]

    o1_valid = o1_cut + p2_rem
    o2_valid = o2_cut + p1_rem

    # Add missing clubs
    o1_missing = [club for club in p1_list if club not in flatten(o1_valid)]
    o2_missing = [club for club in p2_list if club not in flatten(o2_valid)]
    
    o1 = o1_valid + list(group_areas(o1_missing))
    o2 = o2_valid + list(group_areas(o2_missing))
    
    return o1, o2


### Mutation Function
Like the crossover mating function, we need a custom mutation function since I didn't see a swap mutation function.

In [52]:
def swap_mutation(district):
    d_list = flatten(district)

    club_index_1 = random.randint(0, len(d_list))
    club_index_2 = random.randint(0, len(d_list))
    
    club_1 = d_list[club_index_1]
    club_2 = d_list[club_index_2]
    
    d_list[club_index_1] = club_2
    d_list[club_index_2] = club_1
    
    return list(group_areas(d_list))


### Algorithm Constants

In [16]:
POPULATION_SIZE = 200
P_CROSSOVER = 0.5  # probability for crossover
P_MUTATION = 0.1   # probability for mutating an individual
MAX_GENERATIONS = 50
HALL_OF_FAME_SIZE = 10

random.seed(42)

### Initialize algorithm objects

In [56]:
# The weights correspond to quality std and average distance
creator.create('FitnessMin', base.Fitness, weights=(-1, -1))
creator.create('Individual', list, fitness=creator.FitnessMin)

toolbox = base.Toolbox()

#toolbox.register('make_district', areas_list, clubs)
#toolbox.register('individual', tools.initRepeat, creator.Individual, toolbox.make_district, n=1)
toolbox.register('individual', tools.initRepeat, creator.Individual, areas_list, n=1)
toolbox.register('population', tools.initRepeat, list, toolbox.individual)

### Initialize Rest of Genetic Algorithm

In [57]:
toolbox.register('evaluate', evaluate_district)
toolbox.register('mate', group_crossover)
toolbox.register('mutate', swap_mutation)
toolbox.register('select', tools.selTournament, tournsize=3)

### Track evaluation stats

In [58]:
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("min", np.min)
stats.register("avg", np.mean)
hof = tools.HallOfFame(HALL_OF_FAME_SIZE)

In [66]:
#population = toolbox.population(n=POPULATION_SIZE)
population = [areas_list() for _ in range(POPULATION_SIZE)]

In [67]:
population[:2]

[[[1190, 6523, 1165752, 1535564, 2690],
  [3549, 7306126, 8569, 2923054, 4822437],
  [4793192, 5869106, 2912, 1036600, 6590],
  [7402713, 7059, 1408278, 718, 7463287],
  [4154, 3318, 4095, 4015, 3356972],
  [7587, 6071, 2876291, 630505, 9354],
  [6891000, 5918, 3431353, 4108, 7034388],
  [9682, 5569, 3395235, 5258000, 6754191],
  [8169, 6644914, 9598, 4533, 3401898],
  [8952, 5042512, 584516, 4446, 1565753],
  [6380, 992874, 7479409, 9872, 4107],
  [1244830, 7479372, 5069647, 7031829, 5928],
  [8983, 6661, 4721, 3812934, 6820584],
  [730163, 3240871, 7817, 4157985, 6786],
  [3372438, 7384295, 9790, 1171779, 5553533],
  [5509, 614471, 7022029, 4786679, 967],
  [7452, 7881, 5545568, 8853, 6975086],
  [1581643, 6970706, 4700632, 7533, 6654663],
  [953233, 6142, 1412885, 1331602, 1526701],
  [3408653, 8552, 5981, 3859, 4750107],
  [3929213, 6715494, 2556863, 7554675, 997315],
  [1595518, 7274, 8412, 6990556, 4718634],
  [5674, 596735, 2146, 1171849, 1100434],
  [2038660, 4182, 4858, 686132

### Run Algorithm

In [68]:
# perform the Genetic Algorithm flow:
population, logbook = algorithms.eaSimple(population, toolbox, cxpb=P_CROSSOVER, 
                                          mutpb=P_MUTATION, ngen=MAX_GENERATIONS,
                                          stats=stats, halloffame=hof, verbose=True)

# Genetic Algorithm is done - extract statistics:
maxFitnessValues, meanFitnessValues = logbook.select("min", "avg")

AttributeError: 'list' object has no attribute 'fitness'

### Visualize Stats

In [None]:
sns.set_style("whitegrid")
plt.plot(minFitnessValues, color='red')
plt.plot(meanFitnessValues, color='green')
plt.xlabel('Generation')
plt.ylabel('Min / Average Fitness')
plt.title('Min and Average Fitness over Generations')
plt.show()

### Check Best Results

In [None]:
print("Hall of Fame Individuals = ", *hof.items, sep="\n")
print("Best Ever Individual = ", hof.items[0])

### Export Solution(s)

In [None]:
first_district = hof.items[0]
second_district = hof.items[1]
third_district = hof.items[2]