## Watch the [YouTube video](https://www.youtube.com/watch?v=EOaPb9wrgDY) first for more context.

In [None]:
import math
import json
import random
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt

## Setting up the keyboard

### Coordinates

Start by creating a dictionary to store the coordinate positions for each key. In this project, the key labeling order will be top to bottom, left to right. This means that on a QWERTY layout, the "Q" key corresponds to index 0, "A" is 1, "Z" is 2, "W" is 3, "S" is 4, etc. The middle row and bottom row will be offset from the top row. Set these offsets to 0 to represent an ortholinear keyboard. This will create a keyboard with 30 keys.

In [None]:
KEY_WIDTH = 94
MIDDLE_OFFSET = 24
BOTTOM_OFFSET = 71
offsets = [0, MIDDLE_OFFSET, BOTTOM_OFFSET]

coords = {}
for i in range(30):
    row = i%3
    column = math.floor(i/3)
    x = column*KEY_WIDTH + offsets[row]
    y = row*KEY_WIDTH
    coords[i] = (x, y)
coords

### Keys per finger

Define all of the key that each finger is responsible for typing. This can change depending on how many fingers you want to simulate typing with. For example, assuming 10 finger typing on a QWERTY layout, the pinky finger on the left hand will be responsible for typing the "Q", "A", and "Z" keys. Therefore, the keys 0, 1, and 2 will be grouped together in the two dimmensional array. Similarly, the the ring finger on the left hand is responsible for typing "W", "S", and "X", so the keys 3, 4, and 5 will be grouped together. We define define the keys for each finger as elements in a 2 dimensional array.

In [None]:
keys_per_finger = [[0,1,2], [3,4,5], [6,7,8], [9,10,11,12,13,14], [15,16,17,18,19,20], [21,22,23], [24,25,26], [27,28,29]]
# The keys per finger for a 2 finger typing would be defined as follows:
# keys_per_finger = [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14], [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]]
# The keys per finger for a 1 finger typing would be defined as follows:
# keys_per_finger = [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]]

### Home keys

When typing, each finger that is used to type should have home keys defined which is their default position that they start in and return to when not actively typing. For example, for 10 finger typing on a QWERTY keyboard the home keys will be "A", "S", "D", "F", "J", "K", "L", and ";". This corresponds to keys 1, 4, 7, 10, 19, 22, 25, 28. Use a dictionary to associate each key to its corresponding home key.

In [None]:
home_key_pos = [1, 4, 7, 10, 19, 22, 25, 28]
# The home keys for a 2 finger typing would be defined as follows:
# home_key_pos = [7, 22]
# The home key for a 1 finger typing would be defined as follows:
# home_key_pos = [16]
home_keys = {}
for i, keys in enumerate(keys_per_finger):
    for key in keys:
        home_keys[key] = home_key_pos[i]
home_keys

### Keyboard object

Each keyboard will be defined as a dictionary for what character each key corresponds to. This function will convert a genome (string representing a keyboard) into a dictionary. The genome should be in order, so for the QWERTY layout, the genome will be "qazwsxedcrfvtgbyhnujmik,ol.p;/". It will also take into account shift keys like "<", ">", ":", and "?" by adding them to the dictionary with the appropriate corresponding key. 

In [None]:
def genome_to_keyboard(genome):
    keyboard = {}
    for i, char in enumerate(genome):
        keyboard[char] = i
        if char == ',':
            keyboard['<'] = i
        elif char == '.':
            keyboard['>'] = i
        elif char == ';':
            keyboard[':'] = i
        elif char == '/':
            keyboard['?'] = i
    return keyboard

## Data Collection

### arXiv.org dataset

The arXiv.org metadata dataset can be downloaded from [Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv). Download the json file and save it in the same directory as this notebook. The following code will create a collection of abstracts. It will not include any abstracts that contain illegal characters. This will reduce any unwanted data like scientific notation and math equations. In the original video, only a small subset of the dataset of 587 abtracts. The dataset is very large and calculating the distance over the entire dataset will take a very long time.

In [None]:
ARXIV_JSON = 'arxiv-metadata-oai-snapshot.json'
DATA_LIMIT = 586

full_text = ''
legal_chars = 'qazwsxedcrfvtgbyhnujmik,ol.p;:? '

count = 0
with open(ARXIV_JSON) as file:
    for line in file:
        if count > DATA_LIMIT:
            break
        abstract = json.loads(line)['abstract'].replace('\n', ' ').strip().lower()
        if any(char not in legal_chars for char in abstract):
            continue
        full_text += ' ' + abstract
        count += 1

## Calculate distance

### Calculate distance between two keys

This simple function will calculate the distance between any two given keys, given their coordinates as input parameters

In [None]:
def distance(first, second):
    return math.hypot(second[0] - first[0], second[1] - first[1])

### Calculate distance for all letter pairings

Calculate the distance for all valid letter pairings and save it as a dictionary of dictionaries. The key in the dictionary corresponds to the starting key. The inner dictionary key is the ending key, and the inner dictionary value is the distance for that key pairing. For example, the dinstances from key 0 to keys 0, 1, and 2 will be represented as: 
```
{
  0: {
    0: 0.0, 
    1: 1.0320793902668004, 
    2: 2.1378744155702085
  }
}
       
```

In [None]:
distances = {i: {} for i in range(30)}
for keys in keys_per_finger:
    for i in keys:
        for j in keys:
            distances[i][j] = distance(coords[i], coords[j]) / KEY_WIDTH
distances

### Calculate total distance for a given string

This function will calculate the total distance for any given string with any given keyboard. This will fail if there are invalid characters in the string

In [None]:
def total_distance(input_string, keyboard):
    input_string = input_string.lower()
    input_string = input_string.replace(' ', '')
    first_char = input_string[0]
    first_pos = keyboard[first_char]
    first_home_key = home_keys[first_pos]
    total_dist = distances[first_home_key][first_pos]
    for i in range(0, len(input_string)-1):
        cur_char = input_string[i]
        next_char = input_string[i+1]
        cur_pos = keyboard[cur_char]
        next_pos = keyboard[next_char]
        if cur_pos in distances and next_pos in distances[cur_pos]:
            total_dist += distances[cur_pos][next_pos]
        else:
            home_key = home_keys[next_pos]
            total_dist += distances[home_key][next_pos]
    return total_dist

### Test the distance

Calculate the total distance for a test string.

In [None]:
test_string = 'the quick brown fox jumps over the lazy dog.'
qwerty = genome_to_keyboard(list('qazwsxedcrfvtgbyhnujmik,ol.p;/'))
total_distance(test_string, qwerty)

Calculate the total distance for the entire dataset.

In [None]:
total_distance(full_text, qwerty)

## Genetic Algorithm

### Initialize the population

This function will initialize the population for the first generation with random keyboards for a given population size.

In [None]:
def init_population(pop_size):
    keyboard_chars = list('qazwsxedcrfvtgbyhnujmik,ol.p;/')
    population = []
    for i in range(pop_size):
        rand_gnome = keyboard_chars[:]
        random.shuffle(rand_gnome)
        population.append(rand_gnome)
    return population

### Combine two keyboards

This function defines the logic for "mating" two keyboards to create a new keyboard. The function will select a random point to split the keyboards. It will begin filling in the child keyboard to the right of the split point with a random number of keys from the first keyboard. It will then fill in the remaining keys with keys from keyboard 2. There is also a random chance of mutation where two keys on the child keyboard will switch places.

In [None]:
def mate(board1, board2, mutation_rate):
    keyboard_size = len(board1)
    idx = random.randint(0, keyboard_size-1)
    length = random.randint(0,keyboard_size-1)
    child = ['_' for i in range(keyboard_size)]
    for i in range(length):
        if idx > keyboard_size-1:
            idx = 0
        child[idx] = board1[idx]
        idx += 1

    child_idx = idx
    while '_' in child:
        if idx > keyboard_size-1:
            idx = 0
        if child_idx > keyboard_size-1:
            child_idx = 0
        char = board2[idx]
        if char in child:
            idx += 1
            continue
        child[child_idx] = board2[idx]
        child_idx += 1
        idx += 1
        
    prob = random.random()
    if prob < mutation_rate:
        point1 = random.randint(0, 29)
        point2 = random.randint(0, 29)
        allele1 = child[point1]
        allele2 = child[point2]
        child[point1] = allele2
        child[point2] = allele1
        
    return child

### Evaluate the population

This function will evaluate a given population by calcualating the total distance for each keyboard in the population. It returns the evals as a dictionary. It also returns the indicies sorted in order of the total distance of the keyboard at that index.

In [None]:
def get_evals(population):
    evals = {}
    for i, genome in enumerate(population):
        keyboard = genome_to_keyboard(genome)
        dist = total_distance(full_text, keyboard)
        evals[i] = dist
    sorted_evals = [k for k, v in sorted(evals.items(), key=lambda item: item[1])]
    return evals, sorted_evals

### Create next generation

This function will create a new generation from the current population. It will directly copy the top 10% best keyboards from the current generation to the next generation. It will then randomly combine keyboards from the top 50% best keyboards to create the remaining population for the next generation.

In [None]:
def new_generation(population, sorted_evals, p_size, mutation_rate):
    new_gen = []
    
    sorted_population = []
    for i in sorted_evals:
        sorted_population.append(population[i])
        
    for i in range(int(p_size*0.1)):
        new_gen.append(sorted_population[i])

    for _ in range(int(p_size*0.9)):
        p1 = random.choice(sorted_population[:int(p_size*0.5)])
        p2 = random.choice(sorted_population[:int(p_size*0.5)])
        child = mate(p1, p2, mutation_rate)
        new_gen.append(child)
    
    return new_gen

### Run the algorithm

The following code will run the genetic algorithm. Adjust the constants `P_SIZE` to change the population size, `GENERATIONS` to change the total number of generations the algorithm will run, and `MUTATION_RATE` to change how often mutations occur during mating. The training data will be stored to the `learning` dictionary and be saved to a json file once the algorithm is complete. This will contain information for each generation. It will contain the total every keyboard in the population, the best keyboard in the population, the lowest distance of the best keyboard, and the average distance of all the keyboards.

In [None]:
P_SIZE = 100
GENERATIONS = 100
MUTATION_RATE = .1

learning = {
    'generations': {}
}

population = init_population(P_SIZE)

for i in range(GENERATIONS):    
    evals, sorted_evals = get_evals(population)
    sum_evals = 0
    for key in evals:
        sum_evals += evals[key]
    avg_evals = sum_evals/P_SIZE
    learning['generations'][i] = {
        'population': population,
        'best': population[sorted_evals[0]],
        'min': evals[sorted_evals[0]],
        'avg': avg_evals
    }
    print('GEN: {}, AVG: {}, MIN: {}, BEST: {}'.format(i+1, avg_evals, evals[sorted_evals[0]], population[sorted_evals[0]]))
    
    population = new_generation(population, sorted_evals, P_SIZE, MUTATION_RATE)

LEARNING_JSON = 'learning.json'
with open(LEARNING_JSON, 'w') as fp:
    json.dump(learning, fp)

### Visualize the training

This code will take a learning json file and plot the data. It will plot one graph as the average distance for each generation, and another graph for the best distance for each generation.

In [None]:
with open(LEARNING_JSON) as fp:
    learning = json.load(fp)
    
last_dist = 1000000000
min_dists = []
avg_dists = []
generations = len(learning['generations'])

for i in range(0, generations):
    min_dist = learning['generations'][str(i)]['min']
    avg_dist = learning['generations'][str(i)]['avg']
    min_dists.append(min_dist)
    avg_dists.append(avg_dist)

plt.plot(min_dists, label='Lowest Distance')
plt.xlabel('Generations')
plt.ylabel('Distance')
plt.legend()
plt.show()

plt.plot(avg_dists, label='Average Distance', color='orange')
plt.xlabel('Generations')
plt.ylabel('Distance')
plt.legend()
plt.show()

### Visualizing keyboards

The following code will create an image for any keyboard string to help visualize what layout the keyboard string corresponds to. This isn't needed, but can be helpful when viewing keyboards generated by the genetic algorithm.

In [None]:
kb = learning['generations'][str(GENERATIONS-1)]['best'] # best keyboard found

with Image.open("template.jpg").convert("RGBA") as base:

    # make a blank image for the text, initialized to transparent text color
    txt = Image.new("RGBA", base.size, (255, 255, 255, 0))

    # get a font
    fnt = ImageFont.truetype("SFNSMono.ttf", 40)
    # get a drawing context
    d = ImageDraw.Draw(txt)
    
    x_offsets = [110, 135, 175]
    for i in range(30):
        row = i%3
        column = math.floor(i/3)
        x = column*60 + x_offsets[row]
        y = row*65 + 85
        char_coords = (x, y)
        d.text(char_coords, kb[i], font=fnt, fill=(0, 0, 0, 255))

    out = Image.alpha_composite(base, txt)

    display(out)