<p style="font-size: 18px;">
  This is the accompanying code for the post titled "Evolution in Code: Unraveling Genetic Algorithms with Python"<br>
  You can find it <a href="https://pureai.substack.com/p/evolution-in-code">here</a>.<br>
  Published: August 19, 2023<br>
  <a href="https://pureai.substack.com">https://pureai.substack.com</a>
</p>

Welcome to this Jupyter notebook! If you're new to Python or don't have it installed on your system, don't worry; you can still follow along and explore the code.

Here's a quick guide to getting started:

- Using an Online Platform: You can run this notebook in a web browser using platforms like Google Colab or Binder. These services offer free access to Jupyter notebooks and don't require any installation.
- Installing Python Locally: If you'd prefer to run this notebook on your own machine, you'll need to install Python. A popular distribution for scientific computing is Anaconda, which includes Python, Jupyter, and other useful tools.
  - Download Anaconda from [here](https://www.anaconda.com/download).
  - Follow the installation instructions for your operating system.
  - Launch Jupyter Notebook from Anaconda Navigator or by typing jupyter notebook in your command line or terminal.
- Opening the Notebook: Once you have Jupyter running, navigate to the location of this notebook file (.ipynb) and click on it to open.
- Running the Code: You can run each cell in the notebook by selecting it and pressing Shift + Enter. Feel free to modify the code and experiment with it.
- Need More Help?: If you're new to Python or Jupyter notebooks, you might find these resources helpful:
  - [Python.org's Beginner's Guide](https://docs.python.org/3/tutorial/index.html)
  - [Jupyter Notebook Basics](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20Basics.html)

Happy coding, and enjoy exploring the fascinating world of Genetic Algorithms!

In [1]:
# A small readme
from platform import python_version
print(f"The version of Python used is: {python_version()}")

The version of Python used is: 3.10.12


In [2]:
from pprint import pprint
from typing import List, Tuple, Callable, Dict, Union
from collections import Counter
import random

## Local Search - Genetic Algorithm

_This notebook is based on a programming assignment for Johns Hopkins University's Artificial Intelligence course._

There are some key ideas in the Genetic Algorithm.

First, there is a problem of some kind that either *is* an optimization problem or the solution can be expressed in terms of an optimization problem.
For example, if we wanted to minimize the function

$$f(x) = \sum (x_i - 0.5)^2$$

where $n = 10$.
This *is* an optimization problem. Normally, optimization problems are much, much harder.

![Eggholder](http://www.sfu.ca/~ssurjano/egg.png)!

The function we wish to optimize is often called the **objective function**.
The objective function is closely related to the **fitness** function in the GA.
If we have a **maximization** problem, then we can use the objective function directly as a fitness function.
If we have a **minimization** problem, then we need to convert the objective function into a suitable fitness function, since fitness functions must always mean "more is better".

Second, we need to *encode* candidate solutions using an "alphabet" analogous to G, A, T, C in DNA.
This encoding can be quite abstract.
You saw this in the Self Check.
There a floating point number was encoded as bits, just as in a computer and a sophisticated decoding scheme was then required.

Sometimes, the encoding need not be very complicated at all.
For example, in the real-valued GA, discussed in the Lectures, we could represent 2.73 as....2.73.
This is similarly true for a string matching problem.
We *could* encode "a" as "a", 97, or '01100001'.
And then "hello" would be:

```
["h", "e", "l", "l", "o"]
```

or

```
[104, 101, 108, 108, 111]
```

or

```
0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1
```

In Genetics terminology, this is the **chromosome** of the individual. And if this individual had the **phenotype** "h" for the first character then they would have the **genotype** for "h" (either as "h", 104, or 01101000).

To keep it straight, think **geno**type is **genes** and **pheno**type is **phenomenon**, the actual thing that the genes express.
So while we might encode a number as 10110110 (genotype), the number itself, 182, is what goes into the fitness function.
The environment operates on zebras, not the genes for stripes.

## String Matching

We are going to write a Genetic Algorithm that will solve the problem of matching a target string (at least at the start).
Now, this is kind of silly because in order for this to work, you need to know the target string and if you know the target string, why are you trying to do it?
Well, the problem is *pedagogical*.
It's a fun way of visualizing the GA at work, because as the GA finds better and better candidates, they make more and more sense.

Now, string matching is not *directly* an optimization problem so this falls under the general category of "if we convert the problem into an optimization problem we can solve it with an optimization algorithm" approach to problem solving.
This happens all the time.
We have a problem.
We can't solve it.
We convert it to a problem we *can* solve.
In this case, we're using the GA to solve the optimization part.

And all we need is some sort of measure of the difference between two strings.
We can use that measure as a **loss function**.
A loss function gives us a score tells us how similar two strings are.
The loss function becomes our objective function and we use the GA to minimize it by converting the objective function to a fitness function.
So that's the first step, come up with the loss/objective function.
The only stipulation is that it must calculate the score based on element to element (character to character) comparisons with no global transformations of the candidate or target strings.

And since this is a GA, we need a **genotype**.
The genotype for this problem is a list of "characters" (individual letters aren't special in Python like they are in some other languages):

```
["h", "e", "l", "l", "o"]
```

and the **phenotype** is the resulting string:

```
"hello"
```

In addition to the generic code and problem specific loss function, you'll need to pick parameters for the run.
These parameters include:

1. population size
2. number of generations
3. probability of crossover
4. probability of mutation

Every **ten** (10) generations, we will print out the fitness, genotype, and phenotype of the best individual in the population for the specific generation.
The function returns the best individual *of the entire run*, using the same format.

In [3]:
ALPHABET = "abcdefghijklmnopqrstuvwxyz "

### generate_random_population

Creates a population of random strings. Each string is of length target_length, and the characters are randomly chosen from the given alphabet. The population size will be population_size.

* **population_size** (int): The size of the population to generate.
* **target_length** (int): The length of each individual string in the population.
* **alphabet** (List[str]): A list of characters that can be chosen to form the strings.

**returns** List[List[str]]: A list of population_size strings, each of length target_length, with characters randomly chosen from alphabet.

In [4]:
def generate_random_population(population_size: int, target_length: int, alphabet: List[str]) -> List[List[str]]:
    return [[random.choice(alphabet) for _ in range(target_length)] for _ in range(population_size)]

### unit test: generate_random_population

In [5]:
random.seed(42)
alphabet = ['A', 'B', 'C']
result = generate_random_population(2, 3, alphabet)
expected_result = [['C', 'A', 'A'], ['C', 'B', 'A']]
assert result == expected_result, f"Expected {expected_result} but got {result}"

### get_genotype

get_genotype takes a phenotype (represented as a string) and converts it into a genotype by breaking the string into a list of its individual characters.

Parameters:

* **phenotype** (str): The phenotype represented as a string, to be converted into a genotype.

**returns** List[str]: A list of strings, with each element being a character from the phenotype string.

In [6]:
def get_genotype(phenotype: str) -> List[str]:
    return list(phenotype)

### unit test: get_genotype

In [7]:
result = get_genotype("ABC")
expected_result = ['A', 'B', 'C']
assert result == expected_result, f"Expected {expected_result} but got {result}"

### get_phenotype

get_phenotype takes a genotype (represented as a list of strings) and converts it into a phenotype by concatenating the strings into a single string.

* **genotype** (List[str]): The genotype represented as a list of strings, to be converted into a phenotype.

**returns** str: A concatenated string of the elements in the genotype list.

In [8]:
def get_phenotype(genotype: List[str]) -> str:
    return "".join(genotype)

### unit test: get_phenotype

In [9]:
result = get_phenotype(['A', 'B', 'C'])
expected_result = 'ABC'
assert result == expected_result, f"Expected {expected_result} but got {result}"

### roulette_wheel_selection

roulette_wheel_selection performs roulette wheel selection based on fitness values. Given a list of tuples, where the third element represents the fitness of an individual, it probabilistically selects an individual based on their fitness, with a higher fitness leading to a higher chance of selection.

* **fitness_values** (List[Tuple[object, object, float]]): A list of tuples, where each tuple represents an individual, phenotype, and fitness value, respectively.

**returns** Tuple[object, object, float]: A tuple representing the selected individual, containing its individual representation, phenotype, and fitness value.

In [10]:
def roulette_wheel_selection(fitness_values: List[Tuple[object, object, float]]) -> Tuple[object, object, float]:
    total_fitness = sum(x[2] for x in fitness_values)  # Use the third element in tuple for fitness
    pick = random.uniform(0, total_fitness)
    current = 0
    for individual_tuple in fitness_values:
        current += individual_tuple[2]  # Adding fitness value
        if current > pick:
            return individual_tuple  # Return the tuple (individual, phenotype, fitness)

### unit test: roulette_wheel_selection

In [11]:
random.seed(42)
fitness_values = [('ind1', 'phen1', 10), ('ind2', 'phen2', 20), ('ind3', 'phen3', 30)]
result = roulette_wheel_selection(fitness_values)
expected_result = ('ind3', 'phen3', 30)
assert result == expected_result, f"Expected {expected_result} but got {result}"

### tournament_selection

tournament_selection performs tournament selection on a given population and their corresponding fitness values. It randomly selects tournament_size individuals from the population and returns the one with the lowest or highest fitness, depending on the value of the minimize parameter.

* **population** (List[object]): The population represented as a list of individuals.
* **fitness_values** (List[Tuple[object, object, float]]): A list of tuples, where each tuple represents an individual, phenotype, and fitness value, respectively.
* **tournament_size** (int, optional): The number of individuals to randomly select for the tournament. Defaults to 10.
* **minimize** (bool, optional): If True, the function returns the individual with the lowest fitness; if False, the individual with the highest fitness. Defaults to True.

**returns** Tuple[object, object, float]: A tuple representing the selected individual from the tournament, containing its individual representation, phenotype, and fitness value.

In [12]:
def tournament_selection(population: List[object], fitness_values: List[Tuple[object, object, float]], tournament_size: int = 10, minimize: bool = True) -> Tuple[object, object, float]:
    selected_individuals = random.sample(list(zip(population, fitness_values)), tournament_size)

    # If minimize is True, return the selected individual with the lowest fitness
    if minimize:
        return min(selected_individuals, key=lambda x: x[1][2])[1]
    # Otherwise, return the selected individual with the highest fitness
    else:
        return max(selected_individuals, key=lambda x: x[1][2])[1]

### unit test: tournament_selection

In [13]:
random.seed(40)
population = ['ind1', 'ind2', 'ind3']
fitness_values = [('ind1', 'phen1', 10), ('ind2', 'phen2', 20), ('ind3', 'phen3', 30)]
result = tournament_selection(population, fitness_values, tournament_size=2, minimize=False)
expected_result = ('ind2', 'phen2', 20)
assert result == expected_result, f"Expected {expected_result} but got {result}"

### pick_parents

pick_parents picks parent individuals from a given population based on their fitness values. The selection method can be either "roulette" (roulette wheel selection) or "tournament" (tournament selection). The minimize parameter controls whether to minimize or maximize the fitness value.

* **population** (List[object]): The population represented as a list of individuals.
* **fitness_values** (List[Tuple[object, object, float]]): A list of tuples, where each tuple represents an individual, phenotype, and fitness value, respectively.
* **method** (str, optional): The method for selecting parents, either "roulette" or "tournament". Defaults to "roulette".
* **minimize** (bool, optional): If True, the selection minimizes fitness; if False, it maximizes. Only used with the "tournament" method. Defaults to True.

**returns** Tuple[object, object, float]: A tuple representing the selected parent, containing its individual representation, phenotype, and fitness value.

In [14]:
def pick_parents(population: List[object], fitness_values: List[Tuple[object, object, float]], method: str = "roulette", minimize: bool = True) -> Tuple[object, object, float]:
    if method == "roulette":
        return roulette_wheel_selection(fitness_values)
    elif method == "tournament":
        return tournament_selection(population, fitness_values, minimize=minimize)
    else:
        raise ValueError("Invalid method")

### unit test: pick_parents

In [15]:
random.seed(42)
population = ['ind1', 'ind2', 'ind3']
fitness_values = [('ind1', 'phen1', 10), ('ind2', 'phen2', 20), ('ind3', 'phen3', 30)]
result = pick_parents(population, fitness_values, method="roulette", minimize=False)
expected_result = ('ind3', 'phen3', 30)
assert result == expected_result, f"Expected {expected_result} but got {result}"

### crossover

crossover takes two parent individuals and performs single-point crossover to create two child individuals. It uses the get_phenotype function to convert the children's genotypes to phenotypes and then evaluates their fitness using the provided eval_func function.

* **parent1** (Tuple[object, object, float]): The first parent, represented as a tuple containing the individual's genotype, phenotype, and fitness.
* **parent2** (Tuple[object, object, float]): The second parent, represented in the same way as the first parent.
* **target** (object): A target value used in the evaluation of the fitness function.
* **alphabet** (object): The alphabet used in the phenotype representation, if applicable.
* **eval_func** (Callable): A function used to evaluate the fitness of an individual. It must take the phenotype, target, and alphabet as parameters and return a fitness value.

**returns** Tuple[Tuple[object, object, float], Tuple[object, object, float]]: Two tuples, each representing one of the two child individuals created through crossover. Each tuple contains the child's genotype, phenotype, and fitness.

In [16]:
def crossover(parent1: Tuple[object, object, float], parent2: Tuple[object, object, float], target: object, alphabet: object, eval_func: Callable) -> Tuple[Tuple[object, object, float], Tuple[object, object, float]]:
    crossover_point = random.randint(1, len(parent1[0]) - 1)
    child1_individual = parent1[0][:crossover_point] + parent2[0][crossover_point:]
    child2_individual = parent2[0][:crossover_point] + parent1[0][crossover_point:]

    child1_phenotype = get_phenotype(child1_individual)
    child2_phenotype = get_phenotype(child2_individual)

    child1_fitness = eval_func(child1_phenotype, target, alphabet)
    child2_fitness = eval_func(child2_phenotype, target, alphabet)

    return (child1_individual, child1_phenotype, child1_fitness), (child2_individual, child2_phenotype, child2_fitness)

### unit test: crossover

In [17]:
def eval_func_example(phenotype, target, alphabet):
    return abs(len(phenotype) - target)

random.seed(42)
parent1 = (['A', 'B', 'C'], 'ABC', 3)
parent2 = (['X', 'Y', 'Z'], 'XYZ', 3)
target = 3
alphabet = ['A', 'B', 'C', 'X', 'Y', 'Z']
result = crossover(parent1, parent2, target, alphabet, eval_func_example)
expected_result = ((['A', 'Y', 'Z'], 'AYZ', 0), (['X', 'B', 'C'], 'XBC', 0))
assert result == expected_result, f"Expected {expected_result} but got {result}"

### mutate

The mutate function takes an individual represented as a tuple and applies a point mutation with a given probability to its genotype. The new phenotype is then calculated, and the new fitness is evaluated using the provided eval_func function.

* **individual_tuple** (Tuple[object, object, float]): The individual to be mutated, represented as a tuple containing its genotype, phenotype, and fitness.
* **mutation_probability** (float): The probability of mutation at each gene.
* **target** (object): A target value used in the evaluation of the fitness function.
* **alphabet** (object): The alphabet used in the phenotype representation, if applicable.
* **eval_func** (Callable): A function used to evaluate the fitness of an individual. It must take the phenotype, target, and alphabet as parameters and return a fitness value.

**returns** Tuple[object, object, float]: A tuple representing the mutated individual, containing its genotype, phenotype, and new fitness.

In [18]:
def mutate(individual_tuple: Tuple[object, object, float], mutation_probability: float, target: object, alphabet: object, eval_func: Callable) -> Tuple[object, object, float]:
    individual, _, _ = individual_tuple
    for i in range(len(individual)):
        if random.random() < mutation_probability:
            individual[i] = random.choice(alphabet)

    new_phenotype = get_phenotype(individual)
    new_fitness = eval_func(new_phenotype, target, alphabet)

    return individual, new_phenotype, new_fitness

### unit test: mutate

In [19]:
def eval_func_example(phenotype, target, alphabet):
    return abs(len(phenotype) - target)

random.seed(42)
individual_tuple = (['A', 'B', 'C'], 'ABC', 3)
mutation_probability = 1  # Assuring mutation at every position
target = 3
alphabet = ['X', 'Y', 'Z']
result = mutate(individual_tuple, mutation_probability, target, alphabet, eval_func_example)
expected_result = (['X', 'X', 'Z'], 'XXZ', 0)
assert result == expected_result, f"Expected {expected_result} but got {result}"

### sort_population_by_fitness

sort_population_by_fitness takes a population of individuals, represented as genotypes, and sorts them based on their fitness. It converts the genotypes to phenotypes, calculates their fitness using the provided eval_func function, and returns a sorted list of tuples containing the genotype, phenotype, and fitness. The sort order is determined by the minimized parameter.

* **population** (List[object]): The population to be sorted, represented as a list of individuals' genotypes.
* **target** (str): A target value used in the evaluation of the fitness function.
* **alphabet** (str): The alphabet used in the phenotype representation, if applicable.
* **eval_func** (Callable): A function used to evaluate the fitness of an individual. It must take the phenotype, target, and alphabet as parameters and return a fitness value.
* **minimized** (bool): If True, the population is sorted in ascending order by fitness (i.e., fitness is minimized); if False, it is sorted in descending order (i.e., fitness is maximized).

**returns** List[Tuple[object, object, float]]: A sorted list of tuples, where each tuple represents an individual and contains its genotype, phenotype, and fitness.

In [20]:
def sort_population_by_fitness(population: List[object], target: str, alphabet: str, eval_func: Callable, minimized: bool) -> List[Tuple[object, object, float]]:
    if minimized:
        return sorted([(individual, get_phenotype(individual), eval_func(get_phenotype(individual), target, alphabet)) for individual in population], key=lambda x: x[2])
    else:
        return sorted([(individual, get_phenotype(individual), eval_func(get_phenotype(individual), target, alphabet)) for individual in population], key=lambda x: x[2], reverse=True)

### unit test: sort_population_by_fitness

In [21]:
def lists_contain_same_elements(list1, list2):
    return all(elem in list2 for elem in list1) and all(elem in list1 for elem in list2)

def eval_func_example(phenotype, target, alphabet):
    return abs(len(phenotype) - int(target))

population = [['A', 'B'], ['X', 'Y'], ['C', 'D']]
target = '2'
alphabet = 'ABCDXY'
minimized = True
result = sort_population_by_fitness(population, target, alphabet, eval_func_example, minimized)
expected_result = [(['A', 'B'], 'AB', 0), (['C', 'D'], 'CD', 0), (['X', 'Y'], 'XY', 0)]
assert lists_contain_same_elements(result, expected_result), f"Expected {expected_result} but got {result}"

### genetic_algorithm

The genetic_algorithm function executes a genetic algorithm to find a solution to a given problem, defined by the evaluation function (eval_func). The algorithm evolves a population over several generations through selection, crossover, and mutation. A part of the top individuals (as determined by top_percent) is retained from one generation to the next. The process continues for a specified number of generations or until a satisfactory solution is found. The phenotype of the best solution is returned.

* **eval_func** (Callable): A callable that evaluates the fitness of a phenotype.
* **target** (str): The target string to be matched or a value used in fitness evaluation.
* **population_size** (int): The size of the population.
* **generations** (int): The number of generations to run the algorithm.
* **crossover_probability** (float): The probability of crossover happening between two parents.
* **mutation_probability** (float): The probability of mutation in an individual.
* **selection_method** (str): The method of selection (e.g., "roulette" or "tournament").
* **top_percent** (float): The percentage of the top individuals to retain in the next generation.
* **alphabet** (str): The alphabet used in the generation of the population and possible mutations.
* **minimize** (bool): A flag to determine if the fitness should be minimized or maximized. Defaults to true.

**returns** (str): The best phenotype

In [22]:
def genetic_algorithm(
        eval_func: Callable[[str, str, str], int], 
        target: str, 
        population_size: int, 
        generations: int, 
        crossover_probability: float, 
        mutation_probability: float, 
        selection_method: str, 
        top_percent: float, 
        alphabet: str,
        minimize: bool = True) -> str:
    
    target_length = len(target)
    population = generate_random_population(population_size, target_length, alphabet)

    for generation in range(generations):
        fitness_values = sort_population_by_fitness(population, target, alphabet, eval_func, minimize)

        # Show the progress every 10 generations by printing the lowest fitness for the individual
        if generation % 10 == 0:
            print(f"Generation {generation}: Phenotype: {fitness_values[0][1]}, Fitness: {fitness_values[0][2]}, Genotype: {fitness_values[0][0]}")

        # If the target is minimized and the lowest fitness is 0 or the phenotype equals the target, return the phenotype
        if minimize and fitness_values[0][2] == 0 or not minimize and fitness_values[0][2] == len(target):
            return fitness_values[0][1]

        new_population = []
        top_population = int(population_size * top_percent)

        # keep the top phenotypes the same
        for p in fitness_values[:top_population]:
            new_population.append(p)

        for _ in range(population_size // 2):
            parent1 = pick_parents(population, fitness_values, method=selection_method, minimize=minimize)
            parent2 = pick_parents(population, fitness_values, method=selection_method, minimize=minimize)
            if random.random() < crossover_probability:
                child1, child2 = crossover(parent1, parent2, target, alphabet, eval_func)
            else:
                child1, child2 = parent1, parent2
            child1 = mutate(child1, mutation_probability, target, alphabet, eval_func)
            child2 = mutate(child2, mutation_probability, target, alphabet, eval_func)
            new_population.extend([[c for c in child1], [c for c in child2]])

        # Grab only the individuals from the tuple
        new_population = [x[0] for x in new_population]
        population = new_population

    return fitness_values[0][1]

#### evaluate_journey1

The function evaluate_journey1 takes two strings, current_phenotype and target_phenotype, and an alphabet string. It computes a fitness score that represents how different the two phenotypes are. The phenotypes are processed through a get_genotype function (not defined here) and then compared character by character. The difference in the characters' positions in the given alphabet contributes to the fitness score. The lower the fitness score, the better.
    
* **current_phenotype** (str): A string representing the current phenotype that needs to be evaluated.
* **target_phenotype** (str): A string representing the target phenotype to compare against.
* **alphabet** (str): A string representing the alphabet used to map characters to indices.

**returns** (int): The function returns an integer representing the fitness score. The fitness score is calculated by summing the absolute differences in indices of differing characters between the current genotype and the target genotype. If a character is not found in the given alphabet, the length of the alphabet is added to the fitness score. A lower fitness score indicates that the current phenotype is closer to the target phenotype.

In [23]:
def evaluate_journey1(current_phenotype: str, target_phenotype: str, alphabet: str) -> int:
    if len(current_phenotype) != len(target_phenotype):
        raise ValueError("The current phenotype and the target phenotype must be of the same length.")

    current_genotype = get_genotype(current_phenotype)
    target_genotype = get_genotype(target_phenotype)

    # Create a dictionary to map each character to its index
    char_to_index = {char: index for index, char in enumerate(alphabet)}

    fitness = 0
    for i in range(len(current_genotype)):
        curr_char = current_genotype[i]
        target_char = target_genotype[i]
        
        if curr_char == target_char:
            continue
        elif curr_char in char_to_index:
            # Use the precomputed indices
            fitness += abs(char_to_index[curr_char] - char_to_index[target_char])
        else:
            fitness += len(alphabet)

    return fitness

## Journey 1

The target is the string "i love to read pureai".
The fitness function calculates a fitness score based on element to element comparisons between target versus phenotype.

In [24]:
target1 = "i love to read pureai"

In [25]:
result1 = genetic_algorithm(
    eval_func=evaluate_journey1,
    target=target1, 
    population_size=100, 
    generations=1000, 
    crossover_probability=0.9, 
    mutation_probability=0.05,
    top_percent=0.3, 
    selection_method="tournament", 
    alphabet=ALPHABET)

Generation 0: Phenotype: kooovgqpzzxfvcjqvutkc, Fitness: 127, Genotype: ['k', 'o', 'o', 'o', 'v', 'g', 'q', 'p', 'z', 'z', 'x', 'f', 'v', 'c', 'j', 'q', 'v', 'u', 't', 'k', 'c']
Generation 10: Phenotype: g ooxgztpzrfdbspqqfbi, Fitness: 33, Genotype: ['g', ' ', 'o', 'o', 'x', 'g', 'z', 't', 'p', 'z', 'r', 'f', 'd', 'b', 's', 'p', 'q', 'q', 'f', 'b', 'i']
Generation 20: Phenotype: i oovgztpzrfab pvqdbi, Fitness: 15, Genotype: ['i', ' ', 'o', 'o', 'v', 'g', 'z', 't', 'p', 'z', 'r', 'f', 'a', 'b', ' ', 'p', 'v', 'q', 'd', 'b', 'i']
Generation 30: Phenotype: i movgztpzreae pvrdbi, Fitness: 10, Genotype: ['i', ' ', 'm', 'o', 'v', 'g', 'z', 't', 'p', 'z', 'r', 'e', 'a', 'e', ' ', 'p', 'v', 'r', 'd', 'b', 'i']
Generation 40: Phenotype: i moveztozreae pvrdbi, Fitness: 7, Genotype: ['i', ' ', 'm', 'o', 'v', 'e', 'z', 't', 'o', 'z', 'r', 'e', 'a', 'e', ' ', 'p', 'v', 'r', 'd', 'b', 'i']
Generation 50: Phenotype: i move to reae purdai, Fitness: 3, Genotype: ['i', ' ', 'm', 'o', 'v', 'e', ' ', 't',

In [26]:
pprint(result1, compact=True)

'i love to read pureai'


## Journey 2

The objective here is to delve deeper into the concept of fitness functions. The string we're aiming for is 'iaerup daer ot evol i', which is clearly target #1 but in reverse. Matching this string would be a simple task if that's all we wanted to do. However, our challenge in this situation is to "decode" the string so that the top individual reveals the target in the correct order. To achieve this, you must design a fitness function to gauge how well candidates are progressing towards this aim. The limitation is that you cannot apply any global operations to the target or individuals, and your fitness function must continue to evaluate a single gene against another single gene. Although your solution might not adhere to conventional Pythonic practices and might rely on indexing, that's acceptable.

The individual that stands out as the best in the population will be the one that renders this string forwards.

In [27]:
target2 = "iaerup daer ot evol i"

### evaluate_journey2

The function evaluate_journey2 takes an individual string, a target string, and an alphabet (though the alphabet is not used in the function) and calculates a score that represents how closely the individual string matches the reverse of the target string. The score starts at the length of the target and is decremented by one for each character that matches its corresponding character in the reverse of the target string. A lower score indicates a better match.


* **individual** (str): The individual string that needs to be evaluated against the reverse of the target string.
* **target** (str): The target string against which the individual string is being evaluated. The individual string is compared to the reverse of this string.
* **alphabet** (str): Though included as a parameter, it is not used within the function

**returns** (int): The score, representing how closely the individual string matches the reverse of the target string. A lower score indicates a better match, with a minimum score of 0 if the individual string is the exact reverse of the target string.

In [28]:
def evaluate_journey2(individual: str, target: str, alphabet: str) -> int:
    # Score starts at maximum possible difference
    score = len(target)
    
    # Loop through each character (gene) in the individual string
    for index, character in enumerate(individual):
        # Get the corresponding character position from the end of the target string
        corresponding_index = len(target) - index - 1
        
        # If characters match, decrement the score
        if character == target[corresponding_index]:
            score -= 1

    # Lower score indicates a better match to the forward version of the target string
    return score


In [29]:
result2 = genetic_algorithm(
    eval_func=evaluate_journey2,
    target=target2, 
    population_size=100, 
    generations=1000, 
    crossover_probability=0.9, 
    mutation_probability=0.05, 
    top_percent=0.3,
    selection_method="tournament", 
    alphabet=ALPHABET,
    minimize=True)

Generation 0: Phenotype:  vam essqadsahsgjpza , Fitness: 18, Genotype: [' ', 'v', 'a', 'm', ' ', 'e', 's', 's', 'q', 'a', 'd', 's', 'a', 'h', 's', 'g', 'j', 'p', 'z', 'a', ' ']
Generation 10: Phenotype: i xovb th readkpureas, Fitness: 5, Genotype: ['i', ' ', 'x', 'o', 'v', 'b', ' ', 't', 'h', ' ', 'r', 'e', 'a', 'd', 'k', 'p', 'u', 'r', 'e', 'a', 's']
Generation 20: Phenotype: i lovn th read pureai, Fitness: 2, Genotype: ['i', ' ', 'l', 'o', 'v', 'n', ' ', 't', 'h', ' ', 'r', 'e', 'a', 'd', ' ', 'p', 'u', 'r', 'e', 'a', 'i']


In [30]:
pprint(result2, compact=True)

'i love to read pureai'


## Journey 3

This is a variation on the theme of Journey 2.
The Caeser Cypher replaces each letter of a string with the letter 13 characters down alphabet (rotating from "z" back to "a" as needed).
This is also known as ROT13 (for "rotate 13").
Latin did not have spaces (and the space is not continguous with the letters a-z) so we'll remove them from our alphabet.
Again, the goal is to derive a fitness function that compares a single gene against a single gene, without global transformations.
This fitness function assigns higher scores to individuals that correctly decode the target.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        You may not apply ROT13 to an entire string (either target or candidate) at any time.
        Everything must be a computation of one gene against one gene.
    </p>
</div>

The best individual will express the target *decoded*.

In [31]:
ALPHABET3 = "abcdefghijklmnopqrstuvwxyz"

In [32]:
target3 = "vybirgbernqchernv"

#### evaluate_journey3

The evaluate_journey3 function decrypts a given string individual using the ROT13 algorithm and compares it to a target string target. The function increments a score every time a decrypted character in the individual string matches the corresponding character in the target string. The final score, representing the number of matching characters, is returned.

* **individual** (str): The input string to be decrypted using ROT13 and compared to the target string.
* **target** (str): The target string to which the decrypted individual string is compared.
* **alphabet** (str): This parameter is actually unused in the code, and its purpose is not clear from the given code snippet.

**returns** (int): The final score, representing the number of characters in the decrypted individual string that match the corresponding characters in the target string. The higher the score, the better the match.

In [33]:
def evaluate_journey3(individual: str, target: str, alphabet: str) -> int:
    score = 0
    
    # Loop through each character (gene) in the individual string
    for index, character in enumerate(individual):
        # Convert the character to its ASCII code
        ascii_code = ord(character)
        
        # Decrypt the letter using ROT13 (shift by 13 places)
        # The ASCII code for 'a' is 97 and for 'z' is 122
        if 'a' <= character <= 'z':
            decrypted_ascii = (ascii_code - ord('a') + 13) % 26 + ord('a')
        else:
            decrypted_ascii = ascii_code
        
        # Convert the ASCII code back to a character
        decrypted_char = chr(decrypted_ascii)
        
        # If decrypted character matches the target character, increment the score
        if decrypted_char == target[index]:
            score += 1

    # Higher score indicates a better match to the decrypted version of the target string
    return score


In [34]:
result3 = genetic_algorithm(
    eval_func=evaluate_journey3,
    target=target3, 
    population_size=100, 
    generations=1000, 
    crossover_probability=0.9, 
    mutation_probability=0.05, 
    selection_method="tournament",
    top_percent=0.3,
    alphabet=ALPHABET3,
    minimize=False)

Generation 0: Phenotype: xjopdttubmvaxrcte, Fitness: 3, Genotype: ['x', 'j', 'o', 'p', 'd', 't', 't', 'u', 'b', 'm', 'v', 'a', 'x', 'r', 'c', 't', 'e']
Generation 10: Phenotype: oloveaozyaopuream, Fitness: 11, Genotype: ['o', 'l', 'o', 'v', 'e', 'a', 'o', 'z', 'y', 'a', 'o', 'p', 'u', 'r', 'e', 'a', 'm']
Generation 20: Phenotype: ilovetozyaipuream, Fitness: 13, Genotype: ['i', 'l', 'o', 'v', 'e', 't', 'o', 'z', 'y', 'a', 'i', 'p', 'u', 'r', 'e', 'a', 'm']
Generation 30: Phenotype: ilovetozeadpurean, Fitness: 15, Genotype: ['i', 'l', 'o', 'v', 'e', 't', 'o', 'z', 'e', 'a', 'd', 'p', 'u', 'r', 'e', 'a', 'n']


In [35]:
pprint(result3, compact=True)

'ilovetoreadpureai'


## Further Thinking

In Journey 3, we assumed we knew what the shift was in ROT-13.
What if we didn't?
Describe how you might solve that problem including a description of the solution encoding (chromosome and interpretation) and fitness function. Assume we can add spaces into the message.

The problem does become more interesting if we look at trying to solve it without knowing the exact shift like we did in problem 3 (ROT-13). 

**Encoding**

The solution encoding could be handled as follows:

For the Chromosome: A chromosome will represent a possible solution and it will consist of two parts:
- The shift value (k) - an integer value representing how many places each letter in the string should be shifted.
- The decoded message - a string that is obtained by applying the shift value to each character in the encoded message.

Interpretation of the Chromosome: The chromosome is interpreted as the decoded message with a particular shift value.

The chromosome could be represented as shown below. It should be noted that if we're working with a standard English alphabet and a space then we have 27 possible characters (0 to 26 indices).

chromosome = {
    'shift_value': <integer between 0 and 26>,
    'deciphered_text': <sequence of characters>
}

Here, `shift_value` is an integer representing how many positions each letter in the cipher text is shifted. `deciphered_text` is a string representing the text obtained by decrypting the cipher text with the given `shift_value`.

When initializing the population, the goal is to create chromosomes with random values for shift_value (between 0 and 26) and random characters for the `deciphered_text`. Note that the initial `deciphered_text` can be generated by applying the `shift_value` to the original encrypted text. The initial population is expected to contain mostly incorrect deciphered texts.

Here is a more concrete example of a single chromosome from the initial population:

{
    'shift_value': 5,
    'deciphered_text': 'afmmp'
}

What is very interesting to note in this problem, that since the `shift_value` is not known at the start of the problem, it is possible to get jibberish text, as we see in the example above. The genetic algorithm that is used in this problem, if used correctly, should evolve to have chromosomes with the correct `shift_value` and the correct `deciphered_text`. The chromosome where the `deciphered_text` matches that actual human-readable text would have a high fitness score, like:

{
    'shift_value': 2,
    'deciphered_text': 'hello'
}

It should be noted that the chromosome can be encoded as follows: [5, 'a', 'f', 'm', 'm', 'p']. The first location of the list/array is always the `shift_value`. The phenotype (string representation) can then be represented as: "5:afmmp" and the string can be split on the colon to obtain the `shift_value` as well as the `deciphered_text`.

**Fitness Function**

The fitness function is very important to help guide the genetic algorithm towards better solutions. The goal of our fitness function should be to evaluate how close the decrypted text is to being a meaningful English word or words, or how closely it matches the target string, if provided. The fitness function could use something like a dictionary look-up to see how many words are actually found in the target language. The more words found in the dictionary would lead to a higher fitness score. For each character present in either the genotype or phenotype (whatever the fitness function requires), the `shift_value` that is part of the chromosome allows the function to decrypt the character. The score is then computed on a character-by-character basis compared to the target text (assuming it's given). This is similar to what is done in problem 3.

It should also be noted that in addition to the dictionary look-up approach, other statistical methods could be used. These can be things like measuring letter frequencies and/or bigram/trigram frequencies (see https://en.wikipedia.org/wiki/Bigram or https://en.wikipedia.org/wiki/Trigram for more details).

**Genetic Algorithm**

There could be further updates to the main genetic algorithm, if required. For the crossover step, it could involve combining the `shift_value` from one parent with the `deciphered_text` from another parent. For mutation, the `shift_value` and/or `deciphered_text` in the chromosome could be altered. Selection is also a very important aspect in the genetic algorithm, as it decides which chromosomes are chosen for reproduction. As we havee seen, chromosomes with higher fitness (or lower fitness depending upon how the function is defined) have a higher chance of being selected. Additionally, it's important to mention that crossover and mutation should be performed in such a way that valid chromosomes are produced (e.g. the `shift_value` should always be within the valid range).

The genetic algorithm should also contain a termination condition which can consist of hitting the max number of defined generations, the min or max allowed fitness score, or even a time limit if one chooses to define it.

I also think it's worth noting that it isn't guaranteed to find the globle optima when using a genetic algorithm. Genetic algorithms are stochastic and no two runs of the algorithm may produce the exact same result. 