# Evolution Strategies
### (1+1)-ES , (1+\lambda)-ES , (1,\lambda)-ES
Changes the pool that I'm using to find the best solution
- (1+1)-ES is a first-imporvement algorithm that only keeps the best solution where tweak
- (1+\lambda)-ES is a first-improvement algorithm that keeps the best \lambda solutions where tweak
- (1,\lambda)-ES is a steepest-step algorithm where \lambda is the sample size and tweak() is performed using a gaussian mutation like (1,1)
    The current solution is *always* replaced by the best solution found in the sample, even if it's worse. Using a bigger sample size \lambda, the algorithm is more likely to find a better solution (or just "similar" to the current, lowering the risk of loosing the current state), but it's also more expensive.
Causality vs weak causality:
 - Causality: The evaluation function is "smooth" and our sampling will return "coherent" results, driving us to the best solution
 - Weak causality: The evaluation function is more "randomic" and our sampling can return very different result even with little change
(1,\lambda) generates all the \lambda solutions from the starting
The normal distribution is a good choice for the mutation, but how do we select the step-size \sigma? We can use random guess, experience or meta-optimize it.
### Dynamic Strategies
We have the correct balance between exploration and exploitation when the number of successful mutations is *around 1/5* of the total mutations. We can use this to dynamically change the balance between exploration and exploitation, by tweaking the step-size \sigma (diminuishing it to increase exploitation). Is also required some time to see the results of the change, so we need to keep \sigma fixed for some iterations (called *era*).
*Endogenous Parameter* : is a parameter that the algorithm is able to set the by itself without the user intervention 
*Problem Space* : parameters that the user is required to set into the parameter
Our goal is to find the best value for \sigma, we can adopt *self-adaptation* to exploit \sigma as a endogenous parameter by putting \sigma into the list of problem parameters and let the algorithm find the best value for it.
First thing we'll need to mutate \sigma and the parameters in *different* steps, calling learning rate \tau the step-size for \sigma given by 1/sqrt{n} where n is the number of parameters. We can use the following formula to mutate \sigma:
\sigma' = \sigma * e^{\tau * N(0,1)}
where N(0,1) is a random number from a normal distribution with mean 0 and variance 1.
\v' = N(v_i, \sigma')
Here we are computing the new sigma and then using it to compute the new values
Remember to always mutate sigma *before* the parameters, otherwise the parameters will be mutated with the old sigma.
### Self-Adaptation with double learning rate
Another strategy is to have two different learning rates, one "global" learning rate and another "coordinate-wise" (axis-specific) learning rate. The global learning rate is used to mutate the sigma, while the coordinate-wise learning rate is used to mutate the parameters. We can have multidimensional parameters with axis that have different "importance" and/or variance.
### Covariance Matrix Adaptation Evolution Strategy (CMA-ES)
We also add k parameters to the problem (\alpha_1 ... \alpha_k) that are used to compute the Covariance Matrix C given by
c_ii = \sigma_i^2.
c_ij = \sigma_i * \sigma_j * \rho_{ij}
where \rho_{ij} is the correlation between the parameters i and j.



# Evolutionary Computation
Evolutionary algorithm are those capable of evolving a population of solutions to a problem, using the principles of natural selection. The population is initialized with random solutions, then the algorithm iterates over the population, selecting the best solutions and using them to generate new solutions. The new solutions are then evaluated and the best ones are selected again. The process is repeated until a stop condition is met.

The basic idea of evolution is the *accumulation* of tiny variations over time. These variations can be obtained though *mutation* and *selection*.
    - variations : mostly random changes in the solution, random process
    - selection : the best solutions are selected to generate new solutions, deterministic process
In evolutions all the changes are not designed but only evaluated, it's not a random process but a *blind* process that gets "material" from the randomness.
Evolution is not an optimization process, it does not have a goal, nor strenght, nor intelligence
In Practice Evolutionary Algorithm are based on population-based metaheuristics
Candidate solution -> inidividual
set of Candidate solutions -> Population
Ability to solve our problem -> Fitness
Sequence of steps -> Generation
- Individual : Encode a *potential solution* for the problem, for example a list of numbers, a string, a tree, a graph, a set of rules, a neural network, etc.
- Parent Selection - we select some individual and use them as "parents" for the future generation, their "genome" is passed down to new individuals that are re-inserted into the population
- Evaluation and selection - We evaluate our current population based on some euristics 

### Standard Evolutionary Computing Stategies
First strategies were based on "random" modifications over the problem which were then evaluated and selected. 
Nowadays modern ES algorithm includes also recombinations and other strategies. 
(\mi / \rho, \lambda)-ES 
    where \mi is the population size, \lambda is the offsprings size (number of new state to be evaluated) and \rho is the number of parents used to generate the new states.
    (1+1) has lambda = 1 and is the "random" changes strategy, taking only the best solution between the current and the new one 
    (1,1+/lambda) is the steepest step, we take \lambda new states (+ the current) and evaluates the best among those 
    (1,\lambda) is the same but discarding the same solution and taking the best among the others
#### Recombination Strategies.
*Uniform crossover* : starting from the same parents, we take a random bit from one parent and the other bit from the other parent, generating a new individual.
    P_1 = <v1,v2,...,vn>
    P_2 = <w1,w2,...,wn>
    C_1 = <v1,w2,v3,w4,...,vn>
*Averaging crossover* : we take the average of the two parents and use it as the new individual.
    P_1 = <v1,v2,...,vn>
    P_2 = <w1,w2,...,wn>
    C_1 = <(v1+w1)/2,(v2+w2)/2,...,(vn+wn)/2>

### Genetic Algorithm
Is the best known evolutionary algorithm, and is able to solve a wide range of problems, like financial predictions,scheduling,datamining etc...
It's a *population-based* metaheuristic that uses the principles of *natural selection* (Oversimplification of the real biological processes) to evolve a population of solutions to a problem. 
The population is initialized with random solutions, then the algorithm iterates over the population, selecting the best solutions and using them to generate new solutions. 
The new solutions are then evaluated and the best ones are selected again. 
The process is repeated until a stop condition is met.
Usually we can have a random parent selection, or a deterministic one (like the best 10% of the population), and the same can be applied to the survival selection. But remember that a bit of randomness is required to obtain good results.

Classical encodings are *binary strings*, but we can use also other encodings like real numbers, permutations, integers, graphs, etc...
We need to ensure that there is a unique mapping between an individual and a solution (only one fitness for each possible state), and that the solution is feasible. An infeasible solution is a solution that cannot be mapped into our solution space (for example a negative number in a binary string).

- Individual : is one point in out search space
- Chromosome : is the encoding of the individual (a particular characteristic of the individual)
- Genotype : is the set of chromosomes and the encoding (usually same as Genome)
- Phenotype : is the solution that we obtain from the particular genotype ()
- Gene : the smallest element that I'm able to tweak using the mutation (genetic operation) over the genotype (for example Right/Left in a maze)
- Locus : the position of the gene in the genotype

### Genetic Operators
Work on the level of the genotype ignoring everything else:
- Mutation : change a gene in the genotype
- Crossover : combine two genotypes to generate a new one

Survival selection : using the fitness function we select the best individuals to be used as parents for the next generation, and so is *fully deterministic*.
Parent selection : we select the parents to be used to generate the new individuals, and can be deterministic or random. 
A possible selection is the *roulette wheel selection* where we assign a probability to each individual based on their fitness, and then we select the parents using a random number generator. 

Selective pressure : how much more probability has the champion compared to the others individual. Very high selective pressure (when the best has a lot more possibilities to be chosen) can lead to premature convergence, while very low selective pressure can lead to a very slow convergence.
For the roulette wheel we can use a fitness function based on more "abilities" for the individual (can move, able to take off, able to land, etc...) and then use a weighted sum to compute the fitness.

An alternative is the *tournament selection* where we select k individuals from the population and then select the best one among them. The chance of being selected is proportional to the fitness of the individual and is demonstrated to be equivalent to linearized Roulette Wheel but much easier to implement.
In this case the selective pressure is proportional to the number of individuals in the tournament, and is usually set to 2 or 3. (if we select an infinite tournament we're sure to select only the best individual). We can also use decimal number for the tournament size, for example 2.8 means that we select 2 individuals with 20% chance and 3 individuals with 80% chance.

With low selective pressure we have a more *explorative* approach (even weaker individual will have a chance) while with high selective pressure we have a more *exploitative* approach (select only the best).