# Design paradigm

### 1. episode length & epoch average reward

- define episode step length
- if accuracy increased in new generation increase epoch by 1
- if accuracy decreased in new generation increase episode length by previous defined size


### 2. Incremental Genetic

## Improving Genetic Algorithms with Experience Replay Buffers

In our approach to enhance genetic algorithms, we utilize an experience replay buffer to store states where the model takes both good and bad actions. This mechanism allows us to analyze and optimize the model's performance based on the rewards received from those actions.

### Concept Overview

1. **State-Action Pair Storage**:
   - We categorize experiences into two distributions based on the rewards:
     - **Positive Rewards (Green Distribution)**: States where the model took actions that led to high rewards.
     - **Negative Rewards (Red Distribution)**: States where the model took actions that resulted in low or negative rewards.

2. **Population Creation**:
   - Before evaluating the fitness of a new population, we leverage the stored states and corresponding actions from the replay buffer. By analyzing the similarity between the model’s actions and the actual actions taken in these states, we can guide the evolution of the population.
   - Similarity is prioritized in high-reward scenarios, while dissimilarity is emphasized in low-reward scenarios.

3. **Reward Limits**:
   - We establish upper and lower reward limits, starting at zero or the normal reward received during the first step. As the model achieves higher rewards, we push the limits accordingly.
   - By maintaining a buffer at half size, we can eliminate states that overlap, thus pushing the two distributions apart. This helps the model learn from a diverse set of experiences, promoting better generalization.

4. **Variance Reduction**:
   - Over the course of training, we observe a reduction in the variance of both distributions. While the endpoints of the distributions remain the same, the areas near the mean shift outward due to the removal of less desirable experiences from the good buffer and the retention of better experiences in the bad buffer.

### Mathematical Representation

- $X_g \sim \mathcal{N}(\mu_g, \sigma_g^2)$: The Gaussian distribution representing the positive experiences, where $\mu_g$ is the mean reward from good experiences, and $\sigma_g$ is the variance.
- $X_r \sim \mathcal{N}(\mu_r, \sigma_r^2)$: The Gaussian distribution representing the negative experiences, where $\mu_r$ is the mean reward from bad experiences, and $\sigma_r$ is the variance.

The overlap between these distributions signifies the states where the model could be confused or exhibit inconsistent performance. Our goal is to maximize the separation between these distributions over time.

### Visualizing the Distributions

We will visualize the three phases of training by plotting the two Gaussian distributions in Python using Matplotlib. The first phase will show overlapping distributions with higher variances, while the third phase will depict reduced variances and increased separation.

### Conclusion

By visualizing the changes in the Gaussian distributions over the phases of training, we can better understand how the experience replay buffer influences the model's learning process. This approach emphasizes the importance of differentiating between good and bad experiences, allowing the model to adapt and improve its decision-making capabilities over time.