# Direct Policy Search Reinforcement learning algorithm to train model to bring square in the image to its centre. 


Author: Pooja BELURE

In [1]:
import sys
import gym
import numpy as np

## 1. Discrete Action Spaces

We have implimented direct policy search algorithm using Black-Box optimization algoritms (evolutionary computation: CMA-ES, differential evolution: scipy.optimize). We are in the setting of model free approach.

In order to efficienlty train our agent, we must:
* Define your search space (policy space in which we are willing to search for)
* Define your objective function: to assess a policy (Episode-based or step based)
* Optimize the objective using balck-box optimizer (cma-es: use https://pypi.org/project/cma/ ; differential evolution: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html)

Complete Agent Class:
1. `train` method: for optimizing the objective function to get optimal policy
2. `act` method: use optimal policy to output action for each state


In [2]:
import cma

class Agent:
    def __init__(self):
        """
        Init a new agent.
        """
        pass

    def train(self):
        """
        Learn your policy.

        Possible action: [0, 1, 2, 3]
        
        """
        # 1- Define state features [position, velocity]
        # 2- Define search space (to define a policy) -> policy: state (R^2) -> action
        # 3- Define objective function (for policy evaluation)
        # 4- Optimize the objective function
        # 5- Save optimal policy

        # This is an example
        def objective_function(W):
            total = 0
            env = gym.make("Pendulum-v0")
            env.seed(np.random.randint(1000))
            state = env.reset()
            done = False
            while not done:
                action = np.argmax(np.dot(state,W.reshape(2,2)))
                #print("------------>",action)
                state, reward, done, info = env.step(action)
                if reward==1:
                    print("------------>",state,action)
                total += -1
            return - total # loss
        
        self.policy_opt, _ = cma.fmin2(objective_function, np.zeros(4),0.5,restarts=5)

        
    def act(self, state):
        """
        Acts given an observation of the environment (using learned policy).

        Takes as argument an observation of the current state, and
        returns the chosen action.
        See environment documentation: https://github.com/openai/gym/wiki/MountainCar-v0
        Possible action: [0, 1, 2, 3]
        
        """
        return np.argmax(np.dot(state,self.policy_opt.reshape(2,2)))
        #return np.random.choice([0, 1, 2, 3])

In [3]:
agent = Agent()
agent.train()

(4_w,8)-aCMA-ES (mu_w=2.6,w_1=52%) in dimension 4 (seed=158166, Sun Jan 31 17:11:18 2021)
target reached
------------> 628 0
Iterat #Fevals   function value  axis ratio  sigma  min&max std  t[m:s]
    1      8 4.000000000000000e+00 1.0e+00 4.40e-01  4e-01  4e-01 0:00.2
    2     16 2.000000000000000e+02 1.4e+00 5.34e-01  4e-01  7e-01 0:00.2
target reached
------------> 659 3
    3     24 9.000000000000000e+00 1.8e+00 5.62e-01  4e-01  7e-01 0:00.3
target reached
------------> 659 3
target reached
------------> 627 3
target reached
------------> 629 1
target reached
------------> 628 0
target reached
------------> 661 3
target reached
------------> 691 1
   10     80 2.000000000000000e+02 3.7e+00 6.87e-01  5e-01  1e+00 0:00.5
termination on tolflatfitness=1 (Sun Jan 31 17:11:18 2021)
final/bestever f-value = 2.000000e+02 1.000000e+00
incumbent solution: [-1.2750588875592592, 0.03337001739146628, -0.4202057415283743, -0.5167662946012762]
std deviation: [0.6939788568252988, 0.5190511098706

        geno-pheno transformation introduced based on the
        current covariance matrix with condition 1.1e+12 -> 1.0e+00,
        injected solutions become "invalid" in this iteration (class=CMAEvolutionStrategy method=alleviate_conditioning iteration=412)


target reached
------------> 629 3
target reached
------------> 691 3
target reached
------------> 659 3
target reached
------------> 627 3
target reached
------------> 693 3
target reached
------------> 627 3
target reached
------------> 659 3
target reached
------------> 659 3
target reached
------------> 659 3
target reached
------------> 691 3
target reached
------------> 627 3
target reached
------------> 627 3
target reached
------------> 691 3
target reached
------------> 627 3
target reached
------------> 661 3
target reached
------------> 659 3
target reached
------------> 691 3
target reached
------------> 627 3
target reached
------------> 627 3
target reached
------------> 661 3
target reached
------------> 691 3
target reached
------------> 691 3
target reached
------------> 660 3
target reached
------------> 691 3
target reached
------------> 659 3
target reached
------------> 691 3
target reached
------------> 691 3
target reached
------------> 691 3
target reached
-----

### Testing

Run simulation to test your trained agent.

In [4]:
niter = 5000

In [19]:
env = gym.make("Pendulum-v0").env
env.seed(np.random.randint(1, 1000))
env.reset()
i=1
try:
    for i in range(1,niter+1):
        sys.stdout.flush()
        action = agent.act(env.state)
        state, reward, done, info = env.step(action)
        print("state",state,"reward",reward,"action",action,"niter",i)

        # update the visualization
        env.render()
        print("here1")
        i=i+1

        # check for rewards
        if (state >=561 and state<=759):
            print("\rTop reached at t = {}".format(_))
            break
        elif  _ == niter:
            print("\rFailed to reach the top")
       # print("niter==",niter)
finally:
    print("close env")
    

state 740 reward 0 action 2 niter 1
x,y 115 20
here1
Top reached at t = 
close env


In [10]:
env.close()