# Method of finite differences
The first thing to be understood is that ARS is a shallow layering algorithm where we're only going to approximate gradients, differing a lot to how we'd calcualte gradients in traditional gradient descent.

**Before we'd say that:**  
$\hspace{2cm} \Delta w = -\alpha \dfrac{\partial \epsilon}{\partial w_{ij}}$
$\hspace{5.3cm}$ *Where epsilon = error function, alpha = learning rate and w = weight*  
  
  
**But now:**  
$\hspace{2cm} \Delta w \approx -\alpha \dfrac{
                                                \epsilon(w_{ij} + pert_{ij}) - \epsilon(w_{ij})
                                             }{
                                                 pert_{ij}
                                             }$
$\hspace{2cm}$ Where $pert_{ij}$ = the perturbed weight connecting the $i^{th}$ neuron to the $j^{th}$ neuron.  

**as**  
$\hspace{2cm} f^\prime(x) \approx \lim\limits_{h\to 0} \dfrac{f(x + h) - f(x)}{h}$  
  
So as you can see, the error function is now esimated through the incremental ratio shown above as we also want the perturbed weights, $(pert_{ij})$ to approach 0.  
  
Now suppose we had a perceptron where with:  
- 3 input values
- 2 output values
- Therefore 3 x 2 = 6 weights between the 6 synaptic connections

**Our matrix of weights will then be shaped as:**  

$\begin{bmatrix}
    w_{1,1}, & w_{1,2} \\
    w_{2,1}, & w_{2,2} \\
    w_{3,1}, & w_{3,2} \\
\end{bmatrix}$  
  
**And our matrix of positively perturbed weights will be:**  

$\begin{bmatrix}
    w_{1,1} + \sigma p, & w_{1,2} + \sigma p \\
    w_{2,1} + \sigma p, & w_{2,2} + \sigma p \\
    w_{3,1} + \sigma p, & w_{3,2} + \sigma p \\
\end{bmatrix}$  
  
**With negatively perturbed weights being:**  

$\begin{bmatrix}
    w_{1,1} - \sigma p, & w_{1,2} - \sigma p \\
    w_{2,1} - \sigma p, & w_{2,2} - \sigma p \\
    w_{3,1} - \sigma p, & w_{3,2} - \sigma p \\
\end{bmatrix}$  

*p is a random number between 0 and 1, pertubing the weights and sigma is the exploration noise.*   
  
>"Parameter noise lets us teach agents tasks much more rapidly than with other approaches. After learning for 20 episodes on the HalfCheetah Gym environment (shown above), the policy achieves a score of around 3,000, whereas a policy trained with traditional action noise only achieves around 1,500." - https://blog.openai.com/better-exploration-with-parameter-noise/

In this program we'll generate 16 instances of positively and negitively perturbed weights so for generalisation purposes, a - p as p is the $16^{th}$ letter in the alphabet. And each matrice of weights on the AI will have it's own sample of episodes which will be averaged out at the end of training.

## Updating weights with method of finite differences
$w = w_{prev} + \alpha ((Reward_{a-pos} - Reward_{a-neg}) \times \delta{_a} \\
              \hspace{1.8cm} + (Reward_{b-pos} - Reward_{b-neg}) \times \delta{_b} \\
              \hspace{1.8cm} + (Reward_{c-pos} - Reward_{c-neg}) \times \delta{_c} \\
              \hspace{1.8cm} \dots \\
              \hspace{1.8cm} + (Reward_{p-pos} - Reward_{p-neg}) \times \delta{_d})$
              
$\delta{_x}$ are the small added/subtracted values which are used to perturbate weights $x$ - it's the perturbation matrix.

$\alpha$ is the learning rate divided by the number of perturbations.

So we can think of the first example like this:  
$w = w_{prev} + ((Reward_{a-pos} \times \delta{_a}) - (Reward_{a-neg} \times \delta{_a}))$

Expanding out the expression, we can think of the $Reward_{x-posOrNeg}$ as a coefficient to the perturbation matrix $\delta{_x}$ which prevents $w_prev$ from being multiplied by zero because if it was the case that:

$w = w_{prev} + ((a_{pos} - a_{neg}) \times \delta{_a})$

We'd have an issue where $a_pos$ and $a_neg$ would cancel out. But in doing it the way initially portayed above by looking at the rewards gotten by different perturbations, we can then move the new weight in the direction of the better reward as, $((Reward_{a-pos} \times \delta{_a}) - (Reward_{a-neg} \times \delta{_a}))$ provides a vector value with both magnitude and direction being a coefficient to the perturbation matrix.