# Continuous Control

---

Congratulations for completing the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program!  In this notebook, you will learn how to control an agent in a more challenging environment, where the goal is to train a creature with four arms to walk forward.  **Note that this exercise is optional!**

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.registry import default_registry
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Crawler.app"`
- **Windows** (x86): `"path/to/Crawler_Windows_x86/Crawler.exe"`
- **Windows** (x86_64): `"path/to/Crawler_Windows_x86_64/Crawler.exe"`
- **Linux** (x86): `"path/to/Crawler_Linux/Crawler.x86"`
- **Linux** (x86_64): `"path/to/Crawler_Linux/Crawler.x86_64"`
- **Linux** (x86, headless): `"path/to/Crawler_Linux_NoVis/Crawler.x86"`
- **Linux** (x86_64, headless): `"path/to/Crawler_Linux_NoVis/Crawler.x86_64"`

For instance, if you are using a Mac, then you downloaded `Crawler.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Crawler.app")
```

In [2]:
# env = UnityEnvironment(file_name='./Crawler_Linux/Crawler.x86_64')
env = UnityEnvironment(file_name='/home/luis-ferro/test/unity-mlagents/Playground/Builds/Crawler.x86_64')
# env = default_registry['CrawlerStaticTarget'].make()
env.reset()

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
behavior_name = list(env.behavior_specs)[0]
behavior_spec = env.behavior_specs[behavior_name]
behavior_spec

BehaviorSpec(observation_specs=[ObservationSpec(shape=(126,), dimension_property=(<DimensionProperty.NONE: 1>,), observation_type=<ObservationType.DEFAULT: 0>, name='PhysicsBodySensor:Body'), ObservationSpec(shape=(32,), dimension_property=(<DimensionProperty.NONE: 1>,), observation_type=<ObservationType.DEFAULT: 0>, name='VectorSensor_size32')], action_spec=ActionSpec(continuous_size=20, discrete_branches=()))

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [4]:
# Reset the Environment
env.reset()
behavior_spec = env.behavior_specs[behavior_name]

# Number of agents
decision_steps, terminal_steps = env.get_steps(behavior_name)
num_agents = len(decision_steps)
print(f"Number of agents: {num_agents}")

# Size of each action
action_size = behavior_spec.action_spec.continuous_size
print(f"Size of each action: {action_size}")

# Examine the state space
obs_specs = behavior_spec.observation_specs
num_obs = len(obs_specs)
state_size = obs_specs[0].shape
print(f"Number of observations: {num_obs}")
print(f"Observation space: {state_size}")
print(f"There are {num_agents} agents. Each observes a state with length: {state_size}")
print("The state if the first agent looks like: \n", decision_steps[0])

Number of agents: 9
Size of each action: 20
Number of observations: 2
Observation space: (126,)
There are 9 agents. Each observes a state with length: (126,)
The state if the first agent looks like: 
 DecisionStep(obs=[array([-1.5258789e-05, -7.2157383e-03,  1.5258789e-05, -6.8298455e-06,
       -3.4842911e-01, -2.0962533e-04, -9.3733513e-01,  6.6253662e-02,
       -1.3015985e-01, -8.9758301e-01,  6.8731236e-01,  2.6850380e-02,
        2.5180817e-02, -7.2542858e-01,  1.4569092e-01, -1.7535329e-01,
       -1.9746704e+00,  7.3637313e-01,  2.5047736e-02,  2.6944000e-02,
       -6.7557472e-01, -8.9758301e-01, -1.3001561e-01, -6.6207886e-02,
        5.0243717e-01, -4.9525428e-01, -4.6685135e-01, -5.3322589e-01,
       -1.9746704e+00, -1.7414832e-01, -1.4564514e-01,  5.3953069e-01,
       -4.6035105e-01, -5.0130951e-01, -4.9565330e-01, -6.6192627e-02,
       -1.3012457e-01,  8.9759827e-01,  2.5262259e-02, -7.2547233e-01,
       -6.8726772e-01, -2.6736440e-02, -1.4559937e-01, -1.7521238e-01,


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env.reset()
behavior_name = list(env.behavior_specs)[0]
behavior_spec = env.behavior_specs[behavior_name]
ds, ts = env.get_steps(behavior_name)
states = ds.obs[0]
scores = np.zeros(num_agents)
dones = np.zeros(num_agents)
while True:
    # Select random action for each agent
    action = behavior_spec.action_spec.random_action(num_agents)
    # Set the actions
    env.set_actions(behavior_name, action)
    # Move the simulation one step ahead
    env.step()
    # Get the s,a,r,ns tuple
    ds, ts = env.get_steps(behavior_name)
    if len(ts) > 0:
        for agent_id in ts:
            scores[agent_id] += ts[agent_id].reward
            dones[agent_id] = 1
        break
    next_states = ds.obs[0]
    scores += ds.reward
    states = next_states

print(f"Total score (averaged over agents) this episode: {np.mean(scores)}")

Total score (averaged over agents) this episode: -0.20532982376537331


When finished, you can close the environment.

In [6]:
#env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!

In [7]:
import numpy as np
from collections import namedtuple, deque
from typing import List

Experience = namedtuple("Experience", field_names=["state", "prob", "val", "action", "reward", "done"])

class PPOMemory:

    def __init__(self, memory_size: int = int(1e6), batch_size: int = 256) -> None:
        self.trajectories = deque(maxlen=memory_size)
        self.batch_size = batch_size
        self.current_trajectory = []
        self.size = 0

    def store(self, state, action, probs, vals, reward, done) -> None:
        self.current_trajectory.append(
            Experience(state, probs, vals, action, reward, done)
        )

    def end_trajectory(self):
        self.trajectories.append(self.current_trajectory.copy())
        self.current_trajectory = []
        self.size = len(self.trajectories)

    def clear(self):
        self.trajectories = []
        self.current_trajectory = []
        self.size = 0

    def __len__(self):
        return self.size

    def generate_batches(self, max_t: int) -> List[List[Experience]]:
        # First, select trajectories at random
        indices = np.arange(self.size, dtype=np.int32)
        indices = np.random.choice(indices, size=self.batch_size, replace=False)
        trajectories = [self.trajectories[i] for i in indices]

        # Second, for each trajectory select a random start step with max_t size
        return [self.__select_sub_trajectory(t, max_t) for t in trajectories]
    
    @staticmethod
    def __select_sub_trajectory(trajectory, max_t: int):
        steps = len(trajectory)
        start = np.random.choice(steps - max_t - 1)
        return trajectory[start:start + max_t]


In [8]:
memory = PPOMemory(memory_size=10, batch_size=5)

for i in range(50):
    for _ in range(10):
        s, a, p, v, r, d = np.random.randint(0, 100), np.random.randint(0, 100), np.random.randint(0, 100), np.random.randint(0, 100), np.random.randint(0, 100), np.random.randint(0, 100)
        memory.store(s, a, p, v, r, d)
    memory.end_trajectory()

batches = memory.generate_batches(max_t=3)


In [9]:
batches

[[Experience(state=42, prob=4, val=95, action=33, reward=41, done=71),
  Experience(state=98, prob=39, val=93, action=95, reward=67, done=67),
  Experience(state=10, prob=88, val=17, action=32, reward=47, done=63)],
 [Experience(state=51, prob=65, val=21, action=72, reward=86, done=91),
  Experience(state=1, prob=61, val=23, action=48, reward=55, done=91),
  Experience(state=73, prob=85, val=13, action=8, reward=59, done=12)],
 [Experience(state=95, prob=1, val=80, action=14, reward=27, done=12),
  Experience(state=31, prob=93, val=92, action=23, reward=14, done=15),
  Experience(state=60, prob=2, val=2, action=76, reward=64, done=3)],
 [Experience(state=17, prob=38, val=88, action=51, reward=44, done=7),
  Experience(state=4, prob=46, val=14, action=15, reward=97, done=20),
  Experience(state=47, prob=52, val=4, action=75, reward=28, done=98)],
 [Experience(state=88, prob=16, val=89, action=60, reward=55, done=88),
  Experience(state=1, prob=78, val=49, action=68, reward=89, done=46),

In [10]:
import torch
import torch.nn as nn
import torch.optim as optim

class Actor(nn.Module):
    def __init__(
        self, 
        n_actions, 
        input_dims, 
        learning_rate: float, 
        fc_units = [512, 512, 512],
        device: str = 'cpu'):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(*input_dims, fc_units[0]),
            nn.ReLU(),
            nn.Linear(fc_units[0], fc_units[1]),
            nn.ReLU(),
            nn.Linear(fc_units[1], fc_units[2]),
            nn.ReLU()
        )
        self.mu = nn.Linear(fc_units[2], n_actions)
        self.sigma = nn.Linear(fc_units[2], 1)

        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        self.device = device
        self.to(self.device)

    def forward(self, state):
        x = self.network(state)
        mu = self.mu(x)
        sigma = self.sigma(x)
        return mu, sigma

class Critic(nn.Module):

    def __init__(
        self,
        input_dims,
        learning_rate: float,
        fc_units = [512, 512, 512],
        device: str = 'cpu'):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(*input_dims, fc_units[0]),
            nn.ReLU(),
            nn.Linear(fc_units[0], fc_units[1]),
            nn.ReLU(),
            nn.Linear(fc_units[1], fc_units[2]),
            nn.ReLU(),
            nn.Linear(fc_units[2], 1)
        )
        self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        self.device = device
        self.to(self.device)

    def forward(self, state):
        return self.network(state)

In [46]:
class Agent:
    def __init__(
        self,
        n_actions,
        input_dims,
        gamma: float=0.99,
        learning_rate: float=1e-3,
        gae_lambda: float=0.95,
        policy_clip: float=0.2,
        batch_size: int=256,
        memory_size: int=int(1e6),
        n_epochs: int = 10,
        max_t: int = 5,
        device: str = 'cpu'
        ) -> None:
        
        self.gamma = gamma
        self.policy_clip = policy_clip
        self.n_epochs = n_epochs
        self.gae_lambda = gae_lambda
        self.device = device
        self.max_t = max_t
        self.n_actions = n_actions
        self.batch_size = batch_size

        self.actor = Actor(n_actions, input_dims, learning_rate, device=device)
        self.critic = Critic(input_dims, learning_rate, device=device)
        self.memory = PPOMemory(memory_size, batch_size)

    def remember(self, state, action, probs, vals, reward, done):
        self.memory.store(state, action, probs, vals, reward, done)

    def end_trajectory(self):
        self.memory.end_trajectory()

    def act(self, state):
        state = torch.tensor(state, dtype=torch.float).to(self.device)

        mu, sigma = self.actor(state)
        sigma = torch.exp(sigma)
        action_probs = torch.distributions.Normal(mu, sigma)
        probs = action_probs.sample(sample_shape=torch.Size([1])).squeeze()
        # probs = action_probs.sample(sample_shape=torch.Size([self.n_actions]))
        
        log_probs = action_probs.log_prob(probs).to(self.device)
        action = torch.tanh(probs).cpu().data.numpy()
        value = self.critic(state)
        return action, log_probs, value

    def learn(self):
        for _ in range(self.n_epochs):
            # batch of trajectory segments
            batch = self.memory.generate_batches(self.max_t)

            # advantage per each trajectory segment & step in trajectory segment
            # shape: (batch_size, max_t)
            advantage = self.__calc_advantage(batch)
            for i, t_segment in enumerate(batch):
                # "state", "prob", "val", "action", "reward", "done"
                states = [t.state for t in t_segment]
                old_probs = [t.prob for t in t_segment]
                values = [t.val for t in t_segment]
                actions = [t.action for t in t_segment]

                dist = self.actor(states)
                critic_value = self.critic(states)

                new_probs = dist.log_prob(actions)
                prob_ratio = new_probs.exp() / old_probs.exp()
                weighted_probs = advantage[i] * prob_ratio
                weighted_clipped_probs = torch.clamp(prob_ratio, 1 - self.policy_clip, 1 + self.policy_clip) * advantage[i]
                actor_loss = -torch.min(weighted_probs, weighted_clipped_probs).mean()

                returns = advantage[i] + values
                critic_loss = (returns-critic_value) ** 2
                critic_loss = critic_loss.mean()
                
                total_loss = actor_loss + 0.5 * critic_loss
                self.actor.optimizer.zero_grad()
                self.critic.optimizer.zero_grad()
                total_loss.backward()
                self.actor.optimizer.step()
                self.critic.optimizer.step()
            
    def __calc_advantage(self, batch):
        advantage = []
        for i, t_segment in enumerate(batch):
            discount = 1.0
            a_t = 0
            advantage.append([])
            for j in range(len(t_segment) - 1):
                print("Shapes:")
                print(f"reward: {t_segment[j].reward.shape}")
                a_t += discount * (t_segment[j].reward + self.gamma * t_segment[j+1].val * (1 - int(t_segment[j].done)) - t_segment[j].val)
                discount *= self.gamma * self.gae_lambda
                advantage[i].append(a_t)
        return advantage
            


In [47]:
from tqdm import tqdm
from mlagents_envs.environment import ActionTuple

def ppo(
    agent: Agent,
    env,
    n_episodes=1000,
    horizon=300,
    learn_every=20,
):
    scores_window = deque(maxlen=100)
    scores = []
    avg_scores = []
    solved = False
    n_steps = 0
    with tqdm(total=n_episodes) as progress:
        for i_episode in range(1, n_episodes + 1):
            env.reset()
            ds, _ = env.get_steps(behavior_name)
            states = ds.obs[0]
            num_agents = len(ds)
            score = np.zeros((num_agents, 1))            
            for t in range(horizon):
                actions, probs, value = agent.act(states)
                action_tuple = ActionTuple(continuous=actions)

                env.set_actions(behavior_name, action_tuple)
                env.step()
                ds, ts = env.get_steps(behavior_name)
                dones = np.zeros((num_agents, 1))

                if len(ds) > 0:
                    next_states = ds.obs[0]
                    rewards = ds.reward
                    rewards = np.expand_dims(np.asanyarray(rewards), axis=1)

                    agent.remember(states, actions, probs, value, rewards, dones)
                    states = next_states
                    score += rewards
                
                if len(ts) > 0:
                    agent_ids = [ai for ai in ts]
                    states = states[agent_ids, :]
                    next_states = ts.obs[0]
                    rewards = ts.reward
                    rewards = np.expand_dims(np.asanyarray(rewards), axis=1)
                    dones = np.ones((len(ts), 1))
                    agent.remember(states, actions, probs, value, rewards, dones)
                    agent.end_trajectory()
                    for ai in agent_ids:
                        score[ai] += ts[ai].reward
                    break
                
                if len(agent.memory.trajectories) > agent.batch_size  and n_steps % learn_every == 0:
                    agent.learn()
                    
            score = np.mean(score)
            scores_window.append(score)
            scores.append(score)
            avg_score = np.mean(scores_window)
            avg_scores.append(avg_score)

            progress.set_postfix({"Avg. Score": f"{avg_score: .2f}"})
            progress.update()

            if avg_score >= 3000.0:
                print(f"Environment solved at {i_episode} episodes with Avg. Score: {avg_score:.2f}")
                solved = True
                break

    return scores, avg_scores, solved            
                    

In [48]:
%%time

device = "cuda" if torch.cuda.is_available() else "cpu"
n_episodes = 100
agent = Agent(
    action_size, 
    state_size,
    batch_size=16
)
scores, avg_scores, solved = ppo(agent, env, n_episodes=n_episodes)

 17%|█▋        | 17/100 [00:07<00:37,  2.19it/s, Avg. Score=-0.23]

Shapes:
reward: (9, 1)





TypeError: only size-1 arrays can be converted to Python scalars

In [39]:
len(agent.memory.trajectories[0])

7

In [50]:
batch = agent.memory.generate_batches(5)

In [53]:
gamma = 0.99
gae_lambda = 0.95
advantage = []
for i, t_segment in enumerate(batch):
    discount = 1.0
    a_t = 0
    advantage.append([])
    for j in range(len(t_segment) - 1):
        print("Shapes:")
        print(f"reward: {t_segment[j].reward.shape}")
        a_t += discount * (t_segment[j].reward + gamma * t_segment[j+1].val * (1 - int(t_segment[j].done)) - t_segment[j].val)
        discount *= gamma * gae_lambda
        advantage[i].append(a_t)

Shapes:
reward: (9, 1)


TypeError: only size-1 arrays can be converted to Python scalars