# [M2-AI-Univ. Paris Saclay] Direct Policy Search

In this practical, you are asked to put what you just learnt
about direct policy search. 


In this project, you are asked to solve the classic Mountain Car (https://gym.openai.com/envs/MountainCar-v0/). For more details about action and observation space, please refer to the OpenAI
documentation here: https://github.com/openai/gym/wiki/MountainCar-v0

In [None]:
import sys
import gym
import numpy as np

## 1. Discrete Action Spaces

You are expected to implement direct policy search algorithm using Black-Box optimization algoritms (evolutionary computation: CMA-ES, differential evolution: scipy.optimize). We are in the setting of model free approach.

In order to efficienlty train your agent, you must (ref. page 58; Michèle's slides):
* Define your search space (policy space in which your are willing to search for)
* Define your objective function: to assess a policy (Episode-based or step based)
* Optimize the objective using balck-box optimizer (cma-es: use https://pypi.org/project/cma/ ; differential evolution: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html)

Complete Agent Class:
1. `train` method: for optimizing the objective function to get optimal policy
2. `act` method: use optimal policy to output action for each state


In [None]:
## Your import ?
import cma

class Agent:
    def __init__(self):
        """
        Init a new agent.
        """
        pass

    def train(self):
        """
        Learn your policy.

        Possible action: [0, 1, 2]
        Range observation (tuple):
            - position: [-1.2, 0.6]
            - velocity: [-0.07, 0.07]
        """
        # 1- Define state features
        # 2- Define search space (to define a policy)
        # 3- Define objective function (for policy evaluation)
        # 4- Optimize the objective function
        # 5- Save optimal policy

        # This is an example
        def objective_function(policy):
            total = 0
            env = gym.make("MountainCar-v0")
            env.seed(np.random.randint(1000))
            state = env.reset()
            done = False
            while not done:
                action = np.random.choice([0, 1, 2]) # random action :/
                state, reward, done, info = env.step(action)
                total += -1
            return - total # loss
        
        policy_opt, _ = cma.fmin2(objective_function, np.zeros(2), 0.5)

        
    def act(self, observation):
        """
        Acts given an observation of the environment (using learned policy).

        Takes as argument an observation of the current state, and
        returns the chosen action.
        See environment documentation: https://github.com/openai/gym/wiki/MountainCar-v0
        Possible action: [0, 1, 2]
        Range observation (tuple):
            - position: [-1.2, 0.6]
            - velocity: [-0.07, 0.07]
        """
        return np.random.choice([0, 1, 2])

In [None]:
agent = Agent()
agent.train()

### Testing

Run simulation to test your trained agent.

In [None]:
niter = 5000

In [None]:
env = gym.make("MountainCar-v0").env
env.seed(np.random.randint(1, 1000))
env.reset()

try:
    for _ in range(1, niter+1):
        sys.stdout.flush()
        action = agent.act(env.state)
        state, reward, done, info = env.step(action)

        # update the visualization
        env.render()

        # check for rewards
        if state[0] >= 0.5:
            print("\rTop reached at t = {}".format(_))
            break
        elif  _ == niter:
            print("\rFailed to reach the top")
finally:
    env.close()

## 2. Continuous Action Spaces

Unlike MountainCar v0, the action (engine force applied) is allowed to be a continuous value. The goal is to find optimal policy using Direct Search Algorithm while allowing continuous actions.

In [None]:
## Your import ?
import cma

class AgentContinuous:
    def __init__(self):
        """
        Init a new agent.
        """
        pass

    def train(self):
        """
        Learn your policy.

        Possible action: real
        Range observation (tuple):
            - position: [-1.2, 0.6]
            - velocity: [-0.07, 0.07]
        """
        
        def objective_function(policy):
            total = 0
            env = gym.make("MountainCarContinuous-v0")
            env.seed(np.random.randint(1000))
            state = env.reset()
            done = False
            while not done:
                action = [np.random.uniform(-1, 1)] # random action :/
                state, reward, done, info = env.step(action)
                total += -1
            return - total # loss
        
        policy_opt, _ = cma.fmin2(objective_function, np.zeros(2), 0.5)

        
    def act(self, observation):
        """
        Acts given an observation of the environment (using learned policy).

        Takes as argument an observation of the current state, and
        returns the chosen action.
        See environment documentation: https://github.com/openai/gym/wiki/MountainCar-v0
        Possible action: real
        Range observation (tuple):
            - position: [-1.2, 0.6]
            - velocity: [-0.07, 0.07]
        """
        return [np.random.uniform(-1, 1)]

In [None]:
agent_continuous = AgentContinuous()
agent_continuous.train()

### Testing

In [None]:
niter = 5000

In [None]:
env = gym.make("MountainCarContinuous-v0").env
env.seed(np.random.randint(1, 1000))
env.reset()

try:
    for _ in range(1, niter+1):
        sys.stdout.flush()
        action = agent_continuous.act(env.state)
        state, reward, done, info = env.step(action)

        # update the visualization
        env.render()

        # check for rewards
        if state[0] >= 0.5:
            print("\rTop reached at t = {}".format(_))
            break
        elif  _ == niter:
            print("\rFailed to reach the top")
finally:
    env.close()

## 3 - Grading
Run all cells and send output pdf to heri(at)lri(dot)fr before December, 9th 2020 at 23:59.