How to use evolutionary strategies into nim :
- One possible way is to define a **set of rules** that defines some subset of situations, each rule has a weight associated with it. 
  The weights are optimized by the evolutionary strategy and define the probability of the linked rule. 
  Here we're translating the game into a set of rules (that needs to be defined) and a set of weights to optimize.
  To learn the weights we can use a "simulation" of the game, where we play against a random player.
  The fitness function is the number of games won (percentage) against the random player -> we're not evaluating the single move but the whole game. To avoid plateau in the fitness we need a random player that is neither too bad nor too good.
The genome here is everything needed from the algorithm to play (and win) the game. The phenotype is the set of rules and the weights associated with them.
We can add to the phenotype (rules) particular cases like "if we have an odd number of stones and the opponent has an even number of stones we can win" -> this is a particular case of the nimSum rule.
In this case we've an optimal player that can show us the maximum that our algorithm can achieve.
We can also define a "special" rule that uses the nimSum to achieve particular moves (es to win from a certain status). Could be interesting to see if our algorithm converge and find this rule by itself.

### Causality and 
We're working on the genotype and are able to evaluate only the fitness, that's our only way to "see" our changes and choose the way to follow.
By the "locality principle" also called (strong or weak) causality : if a little change in the genotype produces a big change in the fitness that's strong causality, if a little change in the genotype produces a little change in the fitness that's weak causality.
We can use the locality principle to define the mutation rate, if we've strong causality we can use a small mutation rate, if we've weak causality we can use a big mutation rate.
Infeasible solution : is a solution that cannot be evaluated (for example a solution that has a different encoding from the one we're using). We decide which solutions are unfeasible when we define the fitness and the encoding (for example a bad fitness for set-covering is uncapable to evaluate non-covering solutions).

Try to find one case where we can use ES - Evolution Strategies - and not SA - Simulated Annealing - (and viceversa).
Evolution stategies works **only with FP numbers** and not integer points numbers (a solution is to map all the numbers to int - but there are problems with fitness). (? - check if it's true)
Evolution strategies are good when we've a **big number of parameters** to optimize (for example 1000 parameters) and we can't use gradient descent (because we don't have a function to optimize).

John Holland : "Adaptation in Natural and Artificial Systems" -> the book that started the field of evolutionary computation.
John Holland "Royal road" : the idea is that we can define a set of rules that are good for a particular problem and use them to solve it. The problem is that we need to define the rules and that's not always easy.
Problem of one-cut crossover : if we have a solution with a really good gene, we've an high probability that also near genes (not so optimal) could be selected along the good gene -> it's called "genetic hitchhiker" (see on wikipedia).
introns : non-coding genetic material  

**Genetic Programming** -> Genetic Programming : on the programming of computers by means of natural selection (John Koza) -> we can use genetic programming to evolve programs (for example to play a game). The problem is that we need to define the language and the grammar of the language (and that's not always easy).
Now is used mainly in research and in ML-like applications -> symbolic regression (we've a set of data and we want to find a function that fits the data). We can use genetic programming to find the function that fits the data. Now it's almost taken over by neural networks.
In traditional Genetic Programming 
 - the problem space (phenotype) is the formulas 
 - the fitness is the % of cases where the formula succedes
 - the solution space (genotypes) is the *parse tree* representation of the formula
Usually we're able to tell the type of algorithm just by looking to the type of genotype used
  - tree -> genetic programming
  - bit string -> genetic algorithm
  - Floating point -> evolution strategy
We can have different parse tree representation , with all values of the same type (es all floating point in a math formula) or not, we can also represent cyclic operations with parse tree (while loop). A parse tree can be represented as an S-Expression like (- (* A B)C) also called Polish Notation (Reverse Polish Notation used in calculators)

**Main difference between Genetic Programming and Evolution Strategy :**
 - The size of genotype is fixed in Evolution Strategy while is not in Genetic Programming (variable number of loci)
 - The GP has a non-linear structure and the ES works only on FP values

In GP we can have offspring generation as 
 - Mutation : a random change in the tree
 - recombination : swapping two sub-tree
Note that in the original Genetic Programming we have (really) low mutation probability 
After a mutation, in GP we can have an offspring really different from the parent (and can also be really bigger), we could use a (1+lambda) strategy that preserve also the parent
Traditionally in GP we have a really large population (500K / 600K) and a purely generational approach, also called *survival selection* 
Selective pressure : how much the better individuals are more probably selected -> if we're increasing our population we're decreasing the selective pressure in our population -> split the population in two groups (champions vs other) and take 80% from the first and 20% from the second for the parent selection
Initialization, given the max_depth "D" we have different methodologies:
 - Full (all trees built with max_depth)
 - Grow
 - 
NB all the genetic material is created only during the initialization given that we don't use mutation  
"Schema Theorem" -> 
Main problem : *Desctructive operations* like swaps etc , means that we need a lot of offspring to be sure to have something that resemble the parent and not risk to loose material
A solution is to swap only "similar" trees between two individual but how do we define "similar" ?

Incapsulation : at certain time we can try to find some "structure" repeated in the current population and then "freeze" that for some time, and in that time we're not able to modify that structure, simplifing the recombination work and hopefully protecting some "usefull" part in our population

Problems with GP :
 - Bloat : the possibility to have offspring bigger than the parent can lead to a situation where our individuals are "bloated" with giant structures in their genome, we can avoid that with some operators that try to find these situation and destroy the bloated individuals. We can also try to favor "small" individuals and/or implement a penalty for bigger individuals. Another possibility is to use some "fitness hole" where 10% of the time we're comparing the size of the individuals rather than the fitness (and choosing the smaller)
 - Introns
 - Efficiency -> incredibly low efficiency (we can use parallel )

Genetic improvement : we can use GP to improve a program (for example a program that plays a game)

**epoch** (or era) : usually a group of generations where we change someting in the algorithm (for example the mutation rate)
parent selection in ES : **pure random selection** because we don't want selective pression at this step -> the selective pressure is then applied in the offspring selection and in general
In ES the ration between the number of offspring and the population at each generation is a factor in general selective pressure 
On the other hand a "very hard" parent selection is associated with an hill climbing-like behavior (we're not exploring the space but we're trying to find the best solution in the current space)

In general in EA we have two different selections :
    - Parent Selection : we're selecting the parents for the next generation
    - Survival Selection : we're selecting the individuals that will survive to the next generation
We saw three types of evolutionary algorithm:
    - Genetic Algorithm : *parent selection* is basically stochastic and based on **fitness** (we're selecting the best individuals) and *survival selection* is based on **fitness** (we're selecting the best individuals) it's purely deterministic. 
    - Evolution Strategy : *parent selection* is purely random and *survival selection* is deterministic and based on **fitness** (we're selecting the best individuals). The ratio between \mu and \lambda is a factor in the selective pressure.
    - Genetic Programming : same as GA *parent selection* is stochastic and based on **fitness**, *survival selection* is again purely deterministic

Both ES and GA shares a series of key concept but with different names :
                ES ---- GA
+                       <---> Steady State
,                       <---> Generational 
Strong/Weak causality   <---> Locality

In ES we call \mu = population size and \lambda = number of offspring generated at each generation
In a \mu + \lambda strategy we put together the parent and the offspring and then we select the best \mu individuals (purely deterministic)
In a \mu , \lambda strategy we select the best \mu individuals from the offspring (purely deterministic)
Steady state in ES (not mapped to anything in GA) is a very peculiar case where (\mu + 1) we create a new individual and then discard only the worst individual from the population

Steady State (in GA) - we put together the parent and the offspring and then we apply survival selection
Generational (in GA) - we select the best individuals from the offspring

We can Hybridize different algorithms (for example GP and ES): we start from a GP structure and then at some point we "freeze" the GP procedure and starts to "tweak" the structure in a ES way (for example with mutations)

If the tournament selection size = 1 we have 0 selective pressure, then for example with size = 2 we've the same selective pressure of using *linearized fitness roulette* . Then in general if we increase the tournament size we're increasing the selective pressure. With infinite size we're always selecting the best individual. Usually we use tournament size = 2 or 3, nothing above 4/5.

In GP, to manage selective pressure, we split the population in two (best 50% and worst 50%) and then select 80% from the first and 20% from the second to form a tournament

## Evolutionary Programming
the idea of Fogel is to study "intelligence" (i.e. the ability to do smart things). A way to obtain an "intelligent" algorithm is to try to simulate what is going to happen and then act accordingly
If we're able to create a machine able to "foresee" something then that machine can be considered intelligent. For example a SVM can be trained to foresee a sequence of numbers. 
First Evolutionary programming were neural nets for evaluating future values and weights for game moves (blondie 24)