## ARS

Augmented Random Search(ARS) is actually up to 15 TIMES FASTER than other algorithms with higher rewards in specific applications! That’s insane!

One of the ways, ARS is able to be so much faster is that unlike a lot of reinforcement learning algorithms that use deep learning with many hidden layers, augmented random search uses perceptrons! There are fewer weights to adjust and learn, but at the same time, ARS manages to get higher rewards in specific applications!

So, higher rewards AND faster training time!

### Method of Finite Differences
The goal of reinforcement learning is to get an agent to learn something and similar to how deep reinforcement learning algorithms use gradient descent to optimize weights and reduce the cost function, ARS uses the method of finite differences to adjust its weights and learn how to perform a task.

#### The Process
To update the weights effectively, the agent takes a random matrix of tiny values and adds them to the weights. The AI then adds the exact same matrix, but these same values, but this time with negative weights. It then repeats this many times and the end result is an agent trying to perform a task with slightly different weights.

![Title](../docs/images/ARS/MethodOfFiniteDifference.png)

We get rewards for each configuration of weights from the environment and some are higher than others. What ARS does it that it adjusts those weights according to the weight configurations that it gave the best rewards. The higher the reward, the more the weights were adjusted, the lower the reward, the lesser the weights were adjusted.

This is the equation that helps calculate this.

![Title](../docs/images/ARS/MethodOfFiniteDifference_Equation.png)

This picture shows 4 different weight configurations with the coefficient being the difference between the positive configuration of that weight and the negative configuration of that weight. The greater the difference between the rewards, or in other words, the better the reward, the bigger the coefficient will be for that specific configuration of weights, meaning it will influence the weights more.

One of the additional modifications to the above equation is that the researchers discarded low rewards immediately. They only used the top k configurations with the highest reward. Intuitively this makes sense because why would we keep pursuing and experimenting with weights that give a low reward? By taking them out, we save time and computational power!

ARS also does things a little bit differently than other algorithms by exploring policy spaces instead of action spaces. Basically, this means that **instead of analyzing the rewards it gets after each action, it analyzes the reward after a series of actions** to determine if that set of actions led to a higher reward.

#### To sum up the main ideas:
1. ARS uses a perceptron instead of a deep neural network.
2. ARS randomly adds tiny values to the weights along with the negative of that value to figure out if they help the agent get a bigger reward.
3. The bigger the reward from a specific weight configuration, the bigger its influence on the adjustment of the weights.  

All in all, this is an incredible reinforcement learning algorithm and the results are amazing!

## Importing libraries

In [None]:
# Importing the libraries
import os
import numpy as np
import gym
from gym import wrappers
import pybullet_envs

## Setting Configuration

In [None]:
# Setting the Hyper Parameters
class Hp():
    '''
    class to set Hyper Parameters
    '''
    def __init__(self):
        self.nb_steps = 1000
        self.episode_length = 1000
        self.learning_rate = 0.02
        self.nb_directions = 16
        self.nb_best_directions = 16
        assert self.nb_best_directions <= self.nb_directions
        self.noise = 0.03
        self.seed = 1
        self.env_name = 'HalfCheetahBulletEnv-v0'

## Normalizing the  states

In [None]:
# Normalizing the states
class Normalizer():
    '''
    class to normalize the state
    '''
    def __init__(self, nb_inputs):
        self.n = np.zeros(nb_inputs)
        self.mean = np.zeros(nb_inputs)
        self.mean_diff = np.zeros(nb_inputs)
        self.var = np.zeros(nb_inputs)
    
    def observe(self, x):
        self.n += 1.
        last_mean = self.mean.copy()
        self.mean += (x - self.mean) / self.n
        self.mean_diff += (x - last_mean) * (x - self.mean)
        self.var = (self.mean_diff / self.n).clip(min = 1e-2)
    
    def normalize(self, inputs):
        obs_mean = self.mean
        obs_std = np.sqrt(self.var)
        return (inputs - obs_mean) / obs_std

## Building AI

In [None]:
# Building the AI
class Policy():
    
    def __init__(self, input_size, output_size):
        self.theta = np.zeros((output_size, input_size))
    
    def evaluate(self, input, delta = None, direction = None):
        if direction is None:
            return self.theta.dot(input)
        elif direction == "positive":
            return (self.theta + hp.noise*delta).dot(input)
        else:
            return (self.theta - hp.noise*delta).dot(input)
    
    def sample_deltas(self):
        return [np.random.randn(*self.theta.shape) for _ in range(hp.nb_directions)]
    
    def update(self, rollouts, sigma_r):
        '''
        this is where the equation I showed above is implemented! This is how the weights are 
        updated according to which configuration of weights led to the biggest reward!
        '''
        step = np.zeros(self.theta.shape)
        for r_pos, r_neg, d in rollouts:
            step += (r_pos - r_neg) * d
        self.theta += hp.learning_rate / (hp.nb_best_directions * sigma_r) * step


## Additional Functions

In [None]:
# Exploring the policy on one specific direction and over one episode
def explore(env, normalizer, policy, direction = None, delta = None):
    state = env.reset()
    done = False
    num_plays = 0.
    sum_rewards = 0
    while not done and num_plays < hp.episode_length:
        normalizer.observe(state)
        state = normalizer.normalize(state)
        action = policy.evaluate(state, delta, direction)
        state, reward, done, _ = env.step(action)
        reward = max(min(reward, 1), -1)
        sum_rewards += reward
        num_plays += 1
    return sum_rewards

# Training the AI
def train(env, policy, normalizer, hp):
    
    for step in range(hp.nb_steps):
        
        # Initializing the perturbations deltas and the positive/negative rewards
        deltas = policy.sample_deltas()
        positive_rewards = [0] * hp.nb_directions
        negative_rewards = [0] * hp.nb_directions
        
        # Getting the positive rewards in the positive directions
        for k in range(hp.nb_directions):
            positive_rewards[k] = explore(env, normalizer, policy, direction = "positive", delta = deltas[k])
        
        # Getting the negative rewards in the negative/opposite directions
        for k in range(hp.nb_directions):
            negative_rewards[k] = explore(env, normalizer, policy, direction = "negative", delta = deltas[k])
        
        '''
        This part above shows how the positive and negative configurations of the weights are both 
        used in episodes to figure out whether they give a higher reward. The only difference in 
        the two pieces of code is that one points to an equation with a plus sign for the positive 
        configuration and the other points to an equation with a subtraction sign for the negative 
        configuration.
        '''
        
        # Gathering all the positive/negative rewards to compute the standard deviation of these rewards
        all_rewards = np.array(positive_rewards + negative_rewards)
        sigma_r = all_rewards.std()
        
        # Sorting the rollouts by the max(r_pos, r_neg) and selecting the best directions
        scores = {k:max(r_pos, r_neg) for k,(r_pos,r_neg) in enumerate(zip(positive_rewards, negative_rewards))}
        order = sorted(scores.keys(), key = lambda x:scores[x], reverse = True)[:hp.nb_best_directions]
        rollouts = [(positive_rewards[k], negative_rewards[k], deltas[k]) for k in order]
        
        # Updating our policy
        policy.update(rollouts, sigma_r)
        
        # Printing the final reward of the policy after the update
        reward_evaluation = explore(env, normalizer, policy)
        print('Step:', step, 'Reward:', reward_evaluation)

# Running the main code
def mkdir(base, name):
    path = os.path.join(base, name)
    if not os.path.exists(path):
        os.makedirs(path)
    return path


## Main Function

In [None]:
work_dir = mkdir('exp', 'brs')
monitor_dir = mkdir(work_dir, 'monitor')

hp = Hp()
np.random.seed(hp.seed)
env = gym.make(hp.env_name)
env = wrappers.Monitor(env, monitor_dir, force = True)
nb_inputs = env.observation_space.shape[0]
nb_outputs = env.action_space.shape[0]
policy = Policy(nb_inputs, nb_outputs)
normalizer = Normalizer(nb_inputs)
train(env, policy, normalizer, hp)