# Decode Encoded String Using A Dictionary
In this computer assignment, we implement a Genetic algorithm to find the key for decoding encoded string. The answer of questions are written in **Bold**.

## Libraries
1. `re` is imported to remove non-alphabetic characters from both dictionary and encoded string.

2. `time` is imported to measure time duration of solving the problem.

In [4]:
import re
import time
import string
import random
import operator
import numpy as np
from IPython.display import Markdown, display
from string import ascii_lowercase, ascii_uppercase, whitespace, punctuation

## Data
A dictionary , "global_text.txt", and an encoded string, "encoded_text.txt", are given which contains both alphabetic and non-alphabetic characters. To prepare given files, it's necessary to remove non-alphabet characters.

## Decoder
`Decoder()` is in charge of cleaning and decodeing encoded files using methods such as `fitting_fcn()`, `crossover()` and `mutation()`.

### 1. `prepare_dictionary()`
In this method after removing all non-alphabetic characters and split them into words, a dictionary is used to store words by their length. 

### 2. `generate_population()`
It generate a random population of chromosomes with a given size. **Chromosome is a one-to-one maping between alphabetic characters**.

### 3. `fitting_fcn()`
To implement a GA solution, it's necessary to evalute goodness of each chromosome. If we call decoded text as d and dictionary as dict,then fitting value of each chromosome is sum of length of each words in decoded text if it exists in dictionary. 
$$fitting\_value \triangleq \sum\limits_{word \in d \ \& \ dict } length(word)$$

### 4. `crossover()`
In this method, 2 given chromosomes(parent1 and parent2) are combined in order to create create 2 new chromosomes(child1 and child2). After generating a random index (r), [0-r) of parent1 and [r-26] of the parent2 make the child1 and [0-r) of parent2 and [r-26] of parent1 make the child2. There is a set which contains added characters in order to prevent repeated characters.

### 5. `mutation()`
Mutation swap 2 gene (character) values in a chromosome from its initial state. The purpose of mutation in GAs is to introduce diversity into the sampled population. **Mutation operators are used in an attempt to avoid local minima by preventing the population of chromosomes from becoming too similar to each other, thus slowing or even stopping convergence to the global optimum. Therefore using `corssover()` without `mutation()` is not enough to reach a global optimum especially when the first generation has a low population.**

**In other words, `crossover()` is an efficient method at the beginning because it combines chromosomes to find improve their fitting value, but at the final steps we may converge to a local optimum because of limited diversity of the initial population. `mutation()` plays the role key by changing the chromosomes randomly to prevent converging to the local optimum.**

### 6. `generate_next_generation()`
After evaluting fitting value of each choromosome in last generation, they are selected pairwise in order to call `crossover()` and `fitting_fcn()`.
The more fitting value a chromosome has, the more chance of selection it has. On the other words chanse of selection of a chromosome is:

$$p \triangleq \frac{fitting\_fcn(chromosome)}{\sum\limits_{C \in all \ chromosomes } fitting\_fcn(C)}$$

**This selection policy not only helps us to generate the next generation with the best chromosomes but also is a solution to reduce selection bias and similarity among chromosomes because of its randomness decision.**

### 7. `merge_two_generation()`
Chromosomes with the most fitting value in the last generation and new generation will select for next iteration. **It's important to keep the size of each generation constant. Although growing a population increase variation, generating a new generation takes more time than before.**

### 8. `decode()`
It's the main method of our genetic algorithm that generates a new generation in a loop until all words of the decoded string exist in the dictionary. At the end it prints the key and also returns the decoded string.

In [10]:
class Decoder:
    def prepare_dictionary(self, filename):
        raw_dictionary_txt = open("global_text.txt").read()
        dictionary_words = re.findall(r"[a-zA-Z]+", raw_dictionary_txt.lower())
        dictionary = {len(w) : set() for w in dictionary_words}
        for w in dictionary_words:
            dictionary[len(w)].add(w)
        return dictionary
    def generate_population(self, population_size):
        population = []
        for _ in range(population_size):
            shuffle_alphabet = random.sample(ascii_lowercase, len(ascii_lowercase))
            population.append({list(ascii_lowercase)[i]: shuffle_alphabet[i] for i in range(len(ascii_lowercase))})
        return population
    def __init__(self, encoded_text, population_size = 500, dictionary_filename = "global_text.txt"):
        self.encoded_text = encoded_text
        self.encoded_text_words = re.findall(r"[a-zA-Z]+", self.encoded_text.lower())
        self.dictionary = self.prepare_dictionary(dictionary_filename)
        self.population = self.generate_population(population_size)
        self.key = None
    def fitting_fcn(self, chromosomes):
        scores = [0 for _ in range(len(chromosomes))]
        for i, chromosome in enumerate(chromosomes):
            key = str.maketrans(chromosome)
            for w in self.encoded_text_words:
                if w.translate(key) in self.dictionary[len(w)]:
                    scores[i] += len(w)
        return scores
    def crossover(self, parent1, parent2):
        child1 = {}
        child2 = {}
        cross_over_point = random.choice(ascii_lowercase)
        used_chars1 = set()
        used_chars2 = set()
        for c in ascii_lowercase:
            if c < cross_over_point:
                child1[c] = parent1[c]
                used_chars1.add(parent1[c])
            else:
                child2[c] = parent1[c]
                used_chars2.add(parent1[c])
        for c in ascii_lowercase:
            if (c < cross_over_point) and (not parent2[c] in used_chars2):
                child2[c] = parent2[c]
                used_chars2.add(parent2[c])
            elif (c >= cross_over_point) and (not parent2[c] in used_chars1):
                child1[c] = parent2[c]
                used_chars1.add(parent2[c])
        remain_chars1 = set(ascii_lowercase).difference(used_chars1)
        remain_chars2 = set(ascii_lowercase).difference(used_chars2)
        for c in ascii_lowercase:
            if not c in child1:
                child1[c] = remain_chars1.pop()
            if not c in child2:
                child2[c] = remain_chars2.pop()
        return child1, child2
    def mutation(self, chromosome, mutation_rate):
        if random.random() < mutation_rate:
            char1, char2 = random.sample(ascii_lowercase, 2)
            chromosome[char1], chromosome[char2] = chromosome[char2], chromosome[char1]
        return chromosome
    def generate_next_generation(self, mutation_rate):
        next_generation = []
        population_size = len(self.population)
        fitnesses = np.array(self.fitting_fcn(self.population))
        for i in range(population_size):
            parent_index1, parent_index2 = np.random.choice(population_size, 2, p = fitnesses/np.sum(fitnesses))
            child1, child2 = self.crossover(self.population[parent_index1], self.population[parent_index2])
            child1 = self.mutation(child1, mutation_rate)
            child2 = self.mutation(child2, mutation_rate)
            next_generation.append(child1)
            next_generation.append(child2)
        return next_generation
    def merge_two_generation(self, generation1, generation2, generation_size):
        total_generations = generation1 + generation2
        fitnesses = np.array(self.fitting_fcn(total_generations))
        sorted_indices = sorted(range(len(fitnesses)), key=lambda k: fitnesses[k], reverse=True)
        return list(operator.itemgetter(*sorted_indices)(total_generations)[0:generation_size])   
    def prepare_key_table(self):
        keys_string = "|"
        devider_string = "\n|"
        values_string = "\n|"
        for key, value in sorted(self.key.items(), key=lambda item: item[0]):
            keys_string += key + '|'
            devider_string += "---|"
            values_string += value + '|'
        caption = "<caption style=\"text-align:center\">" + "Final key" + "</caption>\n"
        return keys_string + devider_string + values_string
    def decode(self, mutation_rate = 0.7):
        fitnesses = np.array(self.fitting_fcn(self.population))
        while np.max(fitnesses) != len(''.join(self.encoded_text_words)):
            new_generation = self.generate_next_generation(mutation_rate)
            self.population = self.merge_two_generation(self.population, new_generation, len(self.population))
            fitnesses = np.array(self.fitting_fcn(self.population))
        lowercase_key = self.population[np.argmax(fitnesses)]
        self.key = lowercase_key
#         self.print_key(lowercase_key)
        uppercase_key = {x.capitalize()  : y.capitalize() for x, y in lowercase_key.items()}
        uppercase_key.update(lowercase_key)
        key = str.maketrans(uppercase_key)
        return self.encoded_text.translate(key)     

## Results
The genetic algorithm has 2 hyperparameters *mutation rate* and *population size* .
**In this problem, hyperparameters are tuned manually. There is a trade-off between variations in a generation and coputational speed. If you increase the population size, you will find dissimilar population but you lose speed. Mutation gives a chanse to avoid local optimum but if it is set too high, the search will turn into a primitive random search.**
We assumed: $$mutation \ rate = 0.7,\qquad population \ size = 512.$$ 

In [11]:
encoded_text = open("encoded_text.txt").read()
d = Decoder(encoded_text, 512)
start_time = time.time()
decoded_text = d.decode()
display(Markdown("### Time: \n" +  str(time.time() - start_time) + " seconds"))
display(Markdown("### Key:\n" + d.prepare_key_table()))
display(Markdown("### Decoded text:\n" + decoded_text))

### Time: 
65.22407627105713 seconds

### Key:
|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|o|r|s|f|w|m|b|t|i|z|g|h|k|n|v|e|l|p|d|j|c|u|y|q|a|x|

### Decoded text:
This response originally fell into a bit bucket.  I'm reposting it
just so Bill doesn't think I'm ignoring him.

In article <C4w5pv.JxD@darkside.osrhe.uoknor.edu> bil@okcforum.osrhe.edu (Bill Conner) writes:
>Jim Perry (perry@dsinc.com) wrote:
>
>[Some stuff about Biblical morality, though Bill's quote of me had little
> to do with what he goes on to say]

Bill,

I'm sorry to have been busy lately and only just be getting around to
this.

Apparently you have some fundamental confusions about atheism; I think
many of these are well addressed in the famous FAQ.  Your generalisms
are then misplaced -- atheism needn't imply materialism, or the lack
of an absolute moral system.  However, I do tend to materialism and
don't believe in absolute morality, so I'll answer your questions.

>How then can an atheist judge value? 

An atheist judges value in the same way that a theist does: according
to a personal understanding of morality.  That I don't believe in an
absolute one doesn't mean that I don't have one.  I'm just explicit,
as in the line of postings you followed up, that when I express
judgment on a moral issue I am basing my judgment on my own code
rather than claiming that it is in some absolute sense good or bad.
My moral code is not particular different from that of others around
me, be they Christians, Muslims, or atheists.  So when I say that I
object to genocide, I'm not expressing anything particularly out of
line with what my society holds.

If your were to ask why I think morality exists and has the form it
does, my answer would be mechanistic to your taste -- that a moral
code is a prerequisite for a functioning society, and that humanity
probably evolved morality as we know it as part of the evolution of
our ability to exist in large societies, thereby achieving
considerable survival advantages.  You'd probably say that God just
made the rules.  Neither of us can convince the other, but we share a
common understanding about many moral issues.  You think you get it
from your religion, I think I get it (and you get it) from early
childhood teaching.

>That you don't like what God told people to do says nothing about God
>or God's commands, it says only that there was an electrical event in your
>nervous system that created an emotional state that your mind coupled
>with a pre-existing thought-set to form that reaction. 

I think you've been reading the wrong sort of comic books, but in
prying through the gobbledygook I basically agree with what you're
saying.  I do believe that my mental reactions to stimuli such as "God
commanded the genocide of the Canaanites" is mechanistic, but of
course I think that's true of you as well.  My reaction has little to
do with whether God exists or even with whether I think he does, but
if a god existed who commanded genocide, I could not consider him
good, which is supposedly an attribute of God.

>All of this being so, you have excluded
>yourself from any discussion of values, right, wrong, goood, evil,
>etc. and cannot participate. Your opinion about the Bible can have no
>weight whatsoever.

Hmm.  Yes, I think some heavy FAQ-reading would do you some good.  I
have as much place discussing values etc. as any other person.  In
fact, I can actually accomplish something in such a discussion, by
framing the questions in terms of reason: for instance, it is clear
that in an environment where neighboring tribes periodically attempt
to wipe each other out based on imagined divine commands, then the
quality of life will be generally poor, so a system that fosters
coexistence is superior, if quality of life is an agreed goal.  An
absolutist, on the other hand, can only thump those portions of a
Bible they happen to agree with, and say "this is good", even if the
act in question is unequivocally bad by the standards of everyone in
the discussion.  The attempt to define someone or a group of people as
"excluded from discussion", such that they "cannot participate", and
their opinions given "no weight whatsoever" is the lowest form or
reasoning (ad hominem/poisoning the well), and presumably the resort
of someone who can't rationally defend their own ideas of right,
wrong, and the Bible.
-- 
Jim Perry   perry@dsinc.com   Decision Support, Inc., Matthews NC
These are my opinions.  For a nominal fee, they can be yours.

-- 
Jim Perry   perry@dsinc.com   Decision Support, Inc., Matthews NC
These are my opinions.  For a nominal fee, they can be yours.
