# [M2-AI-Univ. Paris Saclay] Direct Policy Search

In this practical, you are asked to put what you just learnt
about direct policy search. 


In this project, you are asked to solve the classic Mountain Car (https://gym.openai.com/envs/MountainCar-v0/). For more details about action and observation space, please refer to the OpenAI
documentation here: https://github.com/openai/gym/wiki/MountainCar-v0

In [None]:
import sys
import gym
import numpy as np

## 1. Discrete Action Spaces

You are expected to implement direct policy search algorithm using Black-Box optimization algoritms (evolutionary computation: CMA-ES, differential evolution: scipy.optimize). We are in the setting of model free approach.

In order to efficienlty train your agent, you must (ref. page 58; Michèle's slides):
* Define your search space (policy space in which your are willing to search for)
* Define your objective function: to assess a policy (Episode-based or step based)
* Optimize the objective using balck-box optimizer (cma-es: use https://pypi.org/project/cma/ ; differential evolution: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html)

Complete Agent Class:
1. `train` method: for optimizing the objective function to get optimal policy
2. `act` method: use optimal policy to output action for each state


In [None]:
## Your import ?
import cma
import sklearn
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler


class Agent:
    def __init__(self):
        """
        Init a new agent.
        """
        
        self.scaler = StandardScaler()

        init_samples = np.array([[np.random.uniform(-1.2, 0.6), np.random.uniform(-0.07, 0.07)] for _ in range(10000)])
        self.scaler.fit(init_samples)
        
    def preprocessing(self, state):
        """
        Returns the featurized representation for a state.
        """
        return self.scaler.transform([state])[0]
    

    def train(self):
        """
        Learn your policy.

        Possible action: [0, 1, 2]
        Range observation (tuple):
            - position: [-1.2, 0.6]
            - velocity: [-0.07, 0.07]
        """
        # 1- Define state features
        # 2- Define search space (to define a policy)
        # 3- Define objective function (for policy evaluation)
        # 4- Optimize the objective function
        # 5- Save optimal policy

        # This is an example
        def objective_function(W):
            total = 0
            env = gym.make("MountainCar-v0")
            env.seed(np.random.randint(1000))
            state = self.preprocessing(env.reset())
            done = False
            while not done:
                action = np.argmax(np.dot(state,W.reshape(2,3)))
                state, reward, done, info = env.step(action)
                state = self.preprocessing(state)
                total += -1
            return - total # loss
        
        self.policy_opt, _ = cma.fmin2(objective_function, np.zeros(6), 0.5, restarts = 6)
        
        
    def act(self, state):
        """
        Acts given an observation of the environment (using learned policy).

        Takes as argument an observation of the current state, and
        returns the chosen action.
        See environment documentation: https://github.com/openai/gym/wiki/MountainCar-v0
        Possible action: [0, 1, 2]
        Range observation (tuple):
            - position: [-1.2, 0.6]
            - velocity: [-0.07, 0.07]
        """
        state = self.preprocessing(state)
        return np.argmax(np.dot(state,self.policy_opt.reshape(2,3)))

In [None]:
agent = Agent()
agent.train()

(4_w,9)-aCMA-ES (mu_w=2.8,w_1=49%) in dimension 6 (seed=657423, Wed Dec  9 21:55:14 2020)
Iterat #Fevals   function value  axis ratio  sigma  min&max std  t[m:s]
    1      9 2.000000000000000e+02 1.0e+00 4.35e-01  4e-01  4e-01 0:00.3
termination on tolfun=1e-11 (Wed Dec  9 21:55:15 2020)
final/bestever f-value = 2.000000e+02 2.000000e+02
incumbent solution: [0.010813019890115894, 0.11754680172913808, 0.10295386772446072, 0.08364592535579597, -0.34214063488013957, 0.2255111897433251]
std deviation: [0.4344329132295457, 0.39575993093355144, 0.44891544552094487, 0.4342984893115924, 0.41402375082637516, 0.43728485418855884]
(9_w,18)-aCMA-ES (mu_w=5.4,w_1=30%) in dimension 6 (seed=657424, Wed Dec  9 21:55:15 2020)
Iterat #Fevals   function value  axis ratio  sigma  min&max std  t[m:s]
    1     28 1.130000000000000e+02 1.0e+00 4.87e-01  5e-01  5e-01 0:00.5
    2     46 1.680000000000000e+02 1.4e+00 5.15e-01  5e-01  5e-01 0:00.9
    3     64 1.150000000000000e+02 1.5e+00 6.21e-01  6e-01  7e

### Testing

Run simulation to test your trained agent.

In [None]:
niter = 5000

In [None]:
env = gym.make("MountainCar-v0").env
env.seed(np.random.randint(1, 1000))
env.reset()

try:
    for _ in range(1, niter+1):
        sys.stdout.flush()
        action = agent.act(env.state)
        state, reward, done, info = env.step(action)

        # update the visualization
        env.render()

        # check for rewards
        if state[0] >= 0.5:
            print("\rTop reached at t = {}".format(_))
            break
        elif  _ == niter:
            print("\rFailed to reach the top")
finally:
    env.close()

Top reached at t = 87


## 2. Continuous Action Spaces

Unlike MountainCar v0, the action (engine force applied) is allowed to be a continuous value. The goal is to find optimal policy using Direct Search Algorithm while allowing continuous actions.

In [None]:
## Your import ?
import cma

class AgentContinuous:
    def __init__(self):
        """
        Init a new agent.
        """

        self.scaler = StandardScaler()

        init_samples = np.array([[np.random.uniform(-1.2, 0.6), np.random.uniform(-0.07, 0.07)] for _ in range(10000)])
        self.scaler.fit(init_samples)
        
    def preprocessing(self, state):
        """
        Returns the featurized representation for a state.
        """
        return self.scaler.transform([state])[0]

    def train(self):
        """
        Learn your policy.

        Possible action: real
        Range observation (tuple):
            - position: [-1.2, 0.6]
            - velocity: [-0.07, 0.07]
        """
        
        def objective_function(W):
            total = 0
            last_pos = []
            env = gym.make("MountainCarContinuous-v0")
            env.seed(np.random.randint(1000))
            state = self.preprocessing(env.reset())
            done = False
            while not done: 
                actions = np.dot(state,W.reshape(2,3))
                action = [max(actions)]
                state, reward, done, info = env.step(action)
                state = self.preprocessing(state)
                total += -1
            return - total # loss
        
        self.policy_opt, _ = cma.fmin2(objective_function, np.zeros(6), 0.5, restarts = 5)

        
    def act(self, state):
        """
        Acts given an observation of the environment (using learned policy).

        Takes as argument an observation of the current state, and
        returns the chosen action.
        See environment documentation: https://github.com/openai/gym/wiki/MountainCar-v0
        Possible action: real
        Range observation (tuple):
            - position: [-1.2, 0.6]
            - velocity: [-0.07, 0.07]
        """
        state = self.preprocessing(state)
        return [max(np.dot(state,self.policy_opt.reshape(2,3)))]

In [None]:
agent_continuous = AgentContinuous()
agent_continuous.train()

(4_w,9)-aCMA-ES (mu_w=2.8,w_1=49%) in dimension 6 (seed=644014, Wed Dec  9 22:23:30 2020)
Iterat #Fevals   function value  axis ratio  sigma  min&max std  t[m:s]
    1      9 1.800000000000000e+02 1.0e+00 4.47e-01  4e-01  5e-01 0:00.6
    2     18 1.820000000000000e+02 1.2e+00 4.73e-01  5e-01  5e-01 0:00.9
    3     27 1.850000000000000e+02 1.4e+00 4.77e-01  4e-01  5e-01 0:01.2
   24    216 8.400000000000000e+01 3.4e+00 2.48e+00  2e+00  4e+00 0:04.2
   62    558 7.800000000000000e+01 4.8e+00 1.96e+00  1e+00  2e+00 0:08.3
  100    900 7.900000000000000e+01 1.1e+01 7.06e+00  2e+00  8e+00 0:12.5
  169   1521 7.800000000000000e+01 2.7e+01 1.01e+00  6e-02  9e-01 0:18.5
  200   1800 7.900000000000000e+01 5.9e+01 5.81e-01  2e-02  5e-01 0:21.1
  216   1944 7.900000000000000e+01 9.2e+01 5.78e-01  2e-02  5e-01 0:22.5
termination on tolflatfitness=1 (Wed Dec  9 22:23:53 2020)
final/bestever f-value = 8.300000e+01 7.600000e+01
incumbent solution: [-0.7928418162878743, 8.802916765383618, 0.16612006

        geno-pheno transformation introduced based on the
        current covariance matrix with condition 1.0e+12 -> 1.0e+00,
        injected solutions become "invalid" in this iteration (class=CMAEvolutionStrategy method=alleviate_conditioning iteration=340)
  ')')


  342  54741 7.500000000000000e+01 1.4e+00 6.29e+01  5e+01  6e+01 2:39.8
  387  57981 7.600000000000000e+01 3.3e+01 1.77e+02  6e+01  3e+02 2:58.9
  400  58917 7.700000000000000e+01 3.6e+01 1.97e+02  8e+01  2e+02 3:04.5
  450  62517 7.700000000000000e+01 4.0e+02 1.11e+02  3e+01  1e+02 3:25.8
  500  66117 7.500000000000000e+01 1.3e+03 1.08e+02  2e+01  1e+02 3:46.9
  555  70077 7.500000000000000e+01 6.3e+03 2.04e+02  1e+01  2e+02 4:10.3
  600  73317 7.800000000000000e+01 1.3e+04 2.15e+02  1e+01  1e+02 4:29.4
  635  75837 7.600000000000000e+01 4.8e+04 1.90e+02  9e+00  1e+02 4:44.4
termination on tolstagnation=263 after 3 restarts (Wed Dec  9 22:32:16 2020)
final/bestever f-value = 7.900000e+01 7.500000e+01
incumbent solution: [390.3066912901122, -11.54953330436183, 532.7033630229998, -163.3748044181266, 778.5131189537933, 358.7532247072086]
std deviation: [58.58081544363217, 137.96665852665538, 73.45051694019735, 10.0025506279262, 9.307829781445122, 113.89919315582063]
(72_w,144)-aCMA-ES (

### Testing

In [None]:
niter = 5000

In [None]:
env = gym.make("MountainCarContinuous-v0").env
env.seed(np.random.randint(1, 1000))
env.reset()

try:
    for _ in range(1, niter+1):
        sys.stdout.flush()
        action = agent_continuous.act(env.state)
        state, reward, done, info = env.step(action)

        # update the visualization
        env.render()

        # check for rewards
        if state[0] >= 0.5:
            print("\rTop reached at t = {}".format(_))
            break
        elif  _ == niter:
            print("\rFailed to reach the top")
finally:
    env.close()

Top reached at t = 79


## 3 - Grading
Run all cells and send output pdf to heri(at)lri(dot)fr before December, 9th 2020 at 23:59.