## README

#### Overview of DDPG algorithm employed

This script is an example of the Deep Deterministic Policy Gradient (DDPG) algorithm, which is a type of reinforcement learning strategy. 

The DDPG algorithm uses two neural networks: an actor network that decides what actions to take, and a critic network that determines how good those actions are.

1. The Actor (Low Bias, High Variance): The actor is trying to directly learn the optimal policy and thus has low bias. However, the target it's trying to learn from - the critic's estimation of the action value function - can be noisy and have high variance, especially early in training when the critic itself is still learning.

2. The Critic (High Bias, Low Variance): The critic is trying to estimate the true action value function based on the current policy of the actor. It's an approximation and thus can suffer from bias. However, because it uses the Bellman equation for updates - averaging over next state values - this can help to reduce variance. 
(Note. it uses Temporal Difference (TD) learning to estimate the Q-value (action-value) function. TD learning makes use of the Bellman equation to estimate the value of a state or action, which is inherently an approximation because it bootstraps from other estimates.)

Together, the interaction of the actor and critic can help balance out these considerations, with the actor providing direction for improvement and the critic correcting for errors in these directions based on its value estimations.

##### Process Summary

1. The environment is initialized along with the DDPG agents.
2. For each episode, the environment is reset and the agent selects an action based on its current policy (the actor network).
3. The environment executes the action and returns the next state, reward, and whether the episode has ended.
4. This information is added to a replay buffer, a memory structure used to store past experiences.
5. Periodically, the agent samples a batch of experiences from the replay buffer and uses it to update the policy (actor network) and the value function (critic network).
6. The actor network is updated to maximize the expected return as evaluated by the critic network. The critic network is updated to minimize the difference between its prediction of the return and the actual return.
7. The weights of the actor and critic networks are softly updated to their target counterparts, which stabilizes learning by allowing the targets to change slowly.
8. This loop continues until termination.

#### Goal

The goal of this script in particular is to train agents to perform a predifined goal over 100 episodes. When that goal is reached, the trained model parameters are saved for future use.

#### Content

In the following sections, there are: 
- Random model: one model of random learning not achieving result (commented out)
- One-agent model: model with ddpg implemented with one agent but slow in achieving results (commented out)
- Multi-agent model: the last model is the one I selected to achieve the desired result

#### Last notes

- It's required to restart the kernel one more time to be sure packages are installed well (eg. pytorch)
- Weights are saved in appropriate pth files for reproducibility purposes

# Continuous Control

---

You are welcome to use this coding environment to train your agent for the project.  Follow the instructions below to get started!

### 1. Start the Environment

Run the next code cell to install a few packages.  This line will take a few minutes to run!

In [1]:
!pip -q install .

[31mERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0m

The environments corresponding to both versions of the environment are already saved in the Workspace and can be accessed at the file paths provided below.  

Please select one of the two options below for loading the environment.

In [2]:
!pip install numpy



In [None]:
!pip install mlagents

In [None]:
from unityagents import UnityEnvironment
import numpy as np

# select this option to load version 1 (with a single agent) of the environment
#env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')


# select this option to load version 2 (with 20 agents) of the environment
env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')

Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [None]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

Run the code cell below to print some information about the environment.

In [12]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Note that **in this coding environment, you will not be able to watch the agents while they are training**, and you should set `train_mode=True` to restart the environment.

In [6]:
# Commented out
'''env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))'''

"env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    \nstates = env_info.vector_observations                  # get the current state (for each agent)\nscores = np.zeros(num_agents)                          # initialize the score (for each agent)\nwhile True:\n    actions = np.random.randn(num_agents, action_size) # select an action (for each agent)\n    actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1\n    env_info = env.step(actions)[brain_name]           # send all actions to tne environment\n    next_states = env_info.vector_observations         # get next state (for each agent)\n    rewards = env_info.rewards                         # get reward (for each agent)\n    dones = env_info.local_done                        # see if episode finished\n    scores += env_info.rewards                         # update the score (for each agent)\n    states = next_states                               # roll over states to next time

When finished, you can close the environment.

In [7]:
#env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  A few **important notes**:
- When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```
- To structure your work, you're welcome to work directly in this Jupyter notebook, or you might like to start over with a new file!  You can see the list of files in the workspace by clicking on **_Jupyter_** in the top left corner of the notebook.
- In this coding environment, you will not be able to watch the agents while they are training.  However, **_after training the agents_**, you can download the saved model weights to watch the agents on your own machine! 

In [3]:
!pip install torch --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


#### Single agent

In [9]:
'''import numpy as np
import random
from collections import namedtuple, deque

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed):

        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)

        self.bn0 = nn.BatchNorm1d(state_size)
        self.fc1 = nn.Linear(state_size, 128)
        self.bn1 = nn.BatchNorm1d(128)
        self.fc2 = nn.Linear(128, 64)
        self.bn2 = nn.BatchNorm1d(64)
        self.fc3 = nn.Linear(64, 32)
        self.bn3 = nn.BatchNorm1d(32)
        self.fc4 = nn.Linear(32, action_size)


    def forward(self, state):
        """Build an actor (policy) network that maps states -> actions."""
        x = self.bn0(state)
        x = F.selu(self.bn1(self.fc1(x)))
        x = F.selu(self.bn2(self.fc2(x)))
        x = F.selu(self.bn3(self.fc3(x)))
        return torch.tanh(self.fc4(x))


class Critic(nn.Module):
    def __init__(self, state_size, action_size, seed):

        super(Critic, self).__init__()
        self.seed = torch.manual_seed(seed)
        
        self.bn0 = nn.BatchNorm1d(state_size)
        self.fcs1 = nn.Linear(state_size, 128)
        self.fc2 = nn.Linear(128 + action_size, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 16)
        self.fc5 = nn.Linear(16, 1)


    def forward(self, state, action):
        state = self.bn0(state)
        x_state = F.selu(self.fcs1(state))
        x = torch.cat((x_state, action), dim=1)
        x = F.selu(self.fc2(x))
        x = F.selu(self.fc3(x))
        x = F.selu(self.fc4(x))
        return  F.selu(self.fc5(x))


class OUNoise:
    def __init__(self, size, mu=0.0, theta=0.15, sigma=0.2):
        self.size = size
        self.mu = mu * np.ones(size)
        self.theta = theta
        self.sigma = sigma
        self.reset()

    def reset(self):
        self.state = np.copy(self.mu)

    def sample(self):
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.random.standard_normal(self.size)
        self.state = x + dx
        return self.state


# Replay Buffer Size
BUFFER_SIZE = int(1e6)
# Minibatch Size
BATCH_SIZE = 256
# Discount Gamma
GAMMA = 0.995 
# Soft Update Value
TAU = 1e-2   
# Learning rates for each NN      
LR_ACTOR = 1e-3 
LR_CRITIC = 1e-3
# Update network every X timesteps
UPDATE_EVERY = 32
# Learn from batch of experiences n_experiences times
N_EXPERIENCES = 16   

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")

class DDPGAgent():
    """Interacts with and learns from the environment using the DDPG algorithm."""

    def __init__(self, state_size, action_size, random_seed):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(random_seed)

        # Actor Neural Network (Regular and target)
        self.actor_regular = Actor(state_size, action_size, random_seed).to(device)
        self.actor_target = Actor(state_size, action_size, random_seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_regular.parameters(), lr=LR_ACTOR)

        # Critic Neural Network (Regular and target)
        self.critic_regular = Critic(state_size, action_size, random_seed).to(device)
        self.critic_target = Critic(state_size, action_size, random_seed).to(device)
        self.critic_optimizer = optim.Adam(self.critic_regular.parameters(), lr=LR_CRITIC)

        # Replay memory
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, random_seed)
          
        # Ensure that both networks have the same weights
        self.deep_copy(self.actor_target, self.actor_regular)
        self.deep_copy(self.critic_target, self.critic_regular)

    def step(self, states, actions, rewards, next_states, dones, timestep):
        # Save collected experiences
        for state, action, reward, next_state, done in zip(states, actions, rewards, next_states, dones):
            self.memory.add(state, action, reward, next_state, done)

        # Learn from our buffer if possible
        if len(self.memory) > BATCH_SIZE and timestep % UPDATE_EVERY == 0:
            for _ in range(N_EXPERIENCES):
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)

    def act(self, states):
        states = torch.from_numpy(states).float().to(device)
        
        # Evaluation mode
        # Notify all your layers that you are in eval mode, that way, 
        # Batchnorm or dropout layers will work in eval mode instead of training mode.
        self.actor_regular.eval()
        # torch.no_grad() impacts the autograd engine and deactivate it. 
        # It will reduce memory usage and speed up
        with torch.no_grad():
            actions = self.actor_regular(states).cpu().data.numpy()
        # Enable Training mode
        self.actor_regular.train()

        return actions


    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences
        
        #--------------------------------
        # Update the critic neural network
        #--------------------------------
        
        # Get predicted next-state actions
        actions_next = self.actor_target(next_states)
        #Get Q values from target model
        Q_targets_next = self.critic_target(next_states, actions_next)

        # Compute Q targets for current states
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        # Calculate the critic loss
        Q_expected = self.critic_regular(states, actions)
        critic_loss = F.mse_loss(Q_expected, Q_targets)

        # Minimize the loss
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        #--------------------------------
        # Update the actor neural network
        #--------------------------------
        
        # Calculate the actor loss
        actions_pred = self.actor_regular(states)
        # Change sign because of the gradient descent
        actor_loss = -self.critic_regular(states, actions_pred).mean()

        # Minimize the loss function
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # Update target network using the soft update approach (slowly updating)
        self.soft_update(self.critic_regular, self.critic_target, TAU)
        self.soft_update(self.actor_regular, self.actor_target, TAU)


    def soft_update(self, regular_model, target_model, tau):
        # Update the target network slowly to improve the stability
        for target_param, regular_param in zip(target_model.parameters(), regular_model.parameters()):
            target_param.data.copy_(tau*regular_param.data + (1.0-tau) * target_param.data)

    def deep_copy(self, target, source):
        for target_param, param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(param.data)


class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""
    
    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)



# Environment Goal
GOAL = 30.1
# Averaged score
SCORE_AVERAGED = 100
# Let us know the progress each 10 timesteps
PRINT_EVERY = 10
# Number of episode for training
N_EPISODES = 500
# Max Timesteps
MAX_TIMESTEPS = 1000

n_agents = 1

# Initialize environment and agent
agent = DDPGAgent(state_size, action_size, random_seed=1)

#  Method for training the agent
def train(n_episodes=N_EPISODES):
    scores_deque = deque(maxlen=SCORE_AVERAGED)
    global_scores = []
    averaged_scores = []
    
    for episode in range(1, N_EPISODES + 1):
        states = env.reset(train_mode=True)[brain_name].vector_observations # Get the current states for each agent
        scores = np.zeros(n_agents) # Init the score of each agent to zeros                

        for t in range(MAX_TIMESTEPS):
            actions = agent.act(states) # Act according to our policy
            env_info = env.step(actions)[brain_name] # Send the decided actions to all the agents       
            next_states = env_info.vector_observations # Get next state for each agent
            rewards = env_info.rewards # Get rewards obtained from each agent           
            dones = env_info.local_done # Info about if an env is done   
            agent.step(states, actions, rewards, next_states, dones, t)  # Learn from the collected experience
            states = next_states # Update current states
            scores += rewards # Add the rewards recieved  
            
            # Stop the loop if an agent is done               
            if np.any(dones):                          
                break
                
        # Calculate scores and averages
        score = np.mean(scores)
        scores_deque.append(score)
        avg_score = np.mean(scores_deque)
        
        global_scores.append(score)
        averaged_scores.append(avg_score)
                
        if episode % PRINT_EVERY == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(episode, np.mean(scores_deque)))  
            
        if avg_score >= GOAL:  
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(episode, avg_score))
            torch.save(agent.actor_regular.state_dict(), 'actor_theta.pth')
            torch.save(agent.critic_regular.state_dict(), 'critic_theta.pth')
            break
            
    return global_scores, averaged_scores

# Train the agent and get the results
scores, averages = train()

# Plot Statistics (Global scores and averaged scores)
plt.subplot(2, 1, 2)
plt.plot(np.arange(1, len(scores) + 1), averages)
plt.ylabel('Reacher Environment Average Score')
plt.xlabel('Episode #')
plt.show()'''



#### Multi agent

In [None]:
import numpy as np
import random
from collections import namedtuple, deque

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt


def reset_parameters(layers):
    for layer in layers:
        layer.weight.data.uniform_(-3e-3,3e-3)

class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc_layers=[400,300]):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc_layers (list): Number of nodes in hidden layers
        """
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)
        # Define input and output values for the hidden layers
        dims = [state_size] + fc_layers + [action_size]
        # Create the hidden layers
        self.fc_layers = nn.ModuleList(
            [nn.Linear(dim_in, dim_out) for dim_in, dim_out in zip(dims[:-1], dims[1:])])
        # Initialize the hidden layer weights
        reset_parameters(self.fc_layers)

        #print('Actor network built:', self.fc_layers)

    def forward(self, x):
        """Build an actor (policy) network that maps states -> actions."""
        # Pass the input through all the layers apllying ReLU activation, but the last
        for layer in self.fc_layers[:-1]:
            x = F.relu(layer(x))
        # Pass the result through the output layer apllying hyperbolic tangent function
        x = torch.tanh(self.fc_layers[-1](x))
        # Return the better action for the input state
        return x


class Critic(nn.Module):
    """Critic (Value) Model."""

    def __init__(self, state_size, action_size, seed, fc_layers=[400,300]):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc_layers (list): Number of nodes in hidden layers
        """
        super(Critic, self).__init__()
        self.seed = torch.manual_seed(seed)
        # Append the output size to the layers's dimensions
        dims = fc_layers + [1]
        # Create a list of layers
        layers_list = []
        layers_list.append(nn.Linear(state_size, dims[0]))
        # The second layer receives the the first layer output + action
        layers_list.append(nn.Linear(dims[0] + action_size, dims[1]))
        # Build the next layers, if that is the case
        for dim_in, dim_out in zip(dims[1:-1], dims[2:]):
            layers_list.append(nn.Linear(dim_in, dim_out))
        # Store the layers as a ModuleList
        self.fc_layers = nn.ModuleList(layers_list)
        # Initialize the hidden layer weights
        reset_parameters(self.fc_layers)
        # Add batch normalization to the first hidden layer
        self.bn = nn.BatchNorm1d(dims[0])

        #print('Critic network built:', self.fc_layers)

    def forward(self, state, action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        # Pass the states into the first layer
        x = self.fc_layers[0](state)
        x = self.bn(x)
        x = F.relu(x)
        # Concatenate the first layer output with the action
        x = torch.cat((x, action), dim=1)
        # Pass the input through all the layers apllying ReLU activation, but the last
        for layer in self.fc_layers[1:-1]:
            x = F.relu(layer(x))
        # Pass the result through the output layer apllying sigmoid activation
        x = torch.sigmoid(self.fc_layers[-1](x))
        # Return the Q-Value for the input state-action
        return x

# Replay Buffer Size
BUFFER_SIZE = int(1e6)
# Minibatch Size
BATCH_SIZE = 256
# Discount Gamma
GAMMA = 0.995 
# Soft Update Value
TAU = 1e-2   
# Learning rates for each NN      
LR_ACTOR = 1e-3 
LR_CRITIC = 1e-3
# Update network every X timesteps
UPDATE_EVERY = 32
# Learn from batch of experiences n_experiences times
N_EXPERIENCES = 16   

#NOISE_DECAY = 0.999

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")

class DDPGAgent():
    """Interacts with and learns from the environment using the DDPG algorithm."""

    def __init__(self, state_size, action_size, random_seed):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(random_seed)

        # Actor Neural Network (Regular and target)
        self.actor_regular = Actor(state_size, action_size, random_seed).to(device)
        self.actor_target = Actor(state_size, action_size, random_seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_regular.parameters(), lr=LR_ACTOR)

        # Critic Neural Network (Regular and target)
        self.critic_regular = Critic(state_size, action_size, random_seed).to(device)
        self.critic_target = Critic(state_size, action_size, random_seed).to(device)
        self.critic_optimizer = optim.Adam(self.critic_regular.parameters(), lr=LR_CRITIC)

        # Replay memory
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, random_seed)
        
        # Noise process
        #self.noise = OUNoise(action_size, random_seed)
        #self.noise_decay = NOISE_DECAY
          
        # Ensure that both networks have the same weights
        self.soft_update(self.actor_target, self.actor_regular, TAU)
        self.soft_update(self.critic_target, self.critic_regular, TAU)

    def step(self, state, action, reward, next_state, done, timestep):
        # Save collected experiences
        """Save experience in replay memory, and use random sample from buffer to learn."""
        # Save experience / reward
        self.memory.add(state, action, reward, next_state, done)

        # Learn from our buffer if possible
        if len(self.memory) > BATCH_SIZE and timestep % UPDATE_EVERY == 0:
            for _ in range(N_EXPERIENCES):
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)

    def act(self, states):
        states = torch.from_numpy(states).float().to(device)
        
        # Evaluation mode
        # Notify all your layers that you are in eval mode, that way, 
        # Batchnorm or dropout layers will work in eval mode instead of training mode.
        self.actor_regular.eval()
        # torch.no_grad() impacts the autograd engine and deactivate it. 
        # It will reduce memory usage and speed up
        with torch.no_grad():
            actions = self.actor_regular(states).cpu().data.numpy() #+ self.noise.get_noise()
        # Enable Training mode
        self.actor_regular.train()

        return actions
    
    #def reset(self):
    #    self.noise.reset()

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences
        
        #--------------------------------
        # Update the critic neural network
        #--------------------------------
        
        # Get predicted next-state actions
        actions_next = self.actor_target(next_states)
        #Get Q values from target model
        #Q_targets_next = self.critic_target(next_states, actions_next)
        
        #Detach tensor from computation graph before performing operations on it:
        Q_targets_next = self.critic_target(next_states, actions_next).detach()

        # Compute Q targets for current states
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        # Calculate the critic loss
        Q_expected = self.critic_regular(states, actions)
        critic_loss = F.mse_loss(Q_expected, Q_targets)

        # Minimize the loss
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        #--------------------------------
        # Update the actor neural network
        #--------------------------------
        
        # Calculate the actor loss
        actions_pred = self.actor_regular(states)
        # Change sign because of the gradient descent
        actor_loss = -self.critic_regular(states, actions_pred).mean()

        # Minimize the loss function
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # Update target network using the soft update approach (slowly updating)
        self.soft_update(self.critic_regular, self.critic_target, TAU)
        self.soft_update(self.actor_regular, self.actor_target, TAU)


    def soft_update(self, regular_model, target_model, tau):
        # Update the target network slowly to improve the stability
        for target_param, regular_param in zip(target_model.parameters(), regular_model.parameters()):
            target_param.data.copy_(tau*regular_param.data + (1.0-tau) * target_param.data)
            
    def reset(self):
        # Reset any necessary variables here, like noise parameters if you have.
        pass

# adding exploration noise to the actions can sometimes improve the speed of training in reinforcement learning environments, especially in the early stages where the agent needs to discover optimal strategies. The reason is that noise encourages exploration of the environment, which can help the agent to avoid getting stuck in sub-optimal policies.
#In DDPG, an Ornstein-Uhlenbeck process is often used to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia.
'''
class OUNoise:
    def __init__(self, action_space_size, mu=0.0, theta=0.15, sigma=0.2):
        self.action_space_size = action_space_size
        self.mu = mu
        #elf.mu = mu * np.ones(action_space_size)
        self.theta = theta
        self.sigma = sigma
        #self.seed = random.seed(seed)
        #self.reset()
        self.state = np.ones(self.action_space_size) * self.mu

    def reset(self):
        """Reset the internal state (= noise) to mean (mu)."""
        #self.state = copy.copy(self.mu)
        self.state = np.ones(self.action_space_size) * self.mu

    def get_noise(self):
        """Update internal state and return it as a noise sample."""
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(len(x))
        self.state = x + dx
        return self.state

'''    
class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""
    
    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

GOAL = 30.1
SCORE_AVERAGED = 100
PRINT_EVERY = 20
N_EPISODES = 500
MAX_TIMESTEPS = 1000

# reset the environment and extract state and action spaces
env_info = env.reset(train_mode=True)[brain_name]
num_agents = len(env_info.agents)
action_size = brain.vector_action_space_size
states = env_info.vector_observations
state_size = states.shape[1]

# Initialize the agents
agent = DDPGAgent(state_size=state_size, action_size=action_size, random_seed=0)

#  Method for training the agent
def train(n_episodes=N_EPISODES):
    scores_deque = deque(maxlen=SCORE_AVERAGED)
    scores = []
    for i_episode in range(1, N_EPISODES+1):
        #agent.noise.reset()
        states = env.reset(train_mode=True)[brain_name].vector_observations
        score_all_agents = np.zeros(num_agents) 
        for t in range(1, MAX_TIMESTEPS+1):
            
            actions = agent.act(states)
            env_info = env.step(actions)[brain_name] 
            next_states = env_info.vector_observations 
            rewards = env_info.rewards                        
            dones = env_info.local_done 
            ## Store experience of all the agents
            for (state, action, reward, next_state, done) \
                    in zip(states, actions, rewards, next_states, dones):
                agent.step(state, action, reward, next_state, done, t)
            states = next_states
            score_all_agents += rewards
        
        scores_deque.append(np.mean(score_all_agents))
        scores.append(np.mean(score_all_agents))
        
        print('\rEpisode {:3d} \tScore: {:5.2f} \t' \
              'Moving average: {:5.2f}' \
              .format(i_episode,  np.mean(score_all_agents), np.mean(scores_deque)), end="")
        
        if i_episode % PRINT_EVERY == 0:
            torch.save(agent.actor_regular.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_regular.state_dict(), 'checkpoint_critic.pth')
            print('\r\nEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            
        if np.mean(scores_deque) >= GOAL:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode,
                                                                                         np.mean(scores_deque)))
            torch.save(agent.actor_regular.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_regular.state_dict(), 'checkpoint_critic.pth')
            break
    return scores

# Train the agent
scores = train()

Episode 20	Average Score: 2.43ing average:  2.43
Episode 40	Average Score: 9.00ing average:  9.00
Episode  42 	Score: 29.30 	Moving average:  9.87

In [None]:
# with noise

In [None]:
'''import numpy as np
import random
from collections import namedtuple, deque

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt


def reset_parameters(layers):
    for layer in layers:
        layer.weight.data.uniform_(-3e-3,3e-3)

class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc_layers=[400,300]):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc_layers (list): Number of nodes in hidden layers
        """
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)
        # Define input and output values for the hidden layers
        dims = [state_size] + fc_layers + [action_size]
        # Create the hidden layers
        self.fc_layers = nn.ModuleList(
            [nn.Linear(dim_in, dim_out) for dim_in, dim_out in zip(dims[:-1], dims[1:])])
        # Initialize the hidden layer weights
        reset_parameters(self.fc_layers)

        #print('Actor network built:', self.fc_layers)

    def forward(self, x):
        """Build an actor (policy) network that maps states -> actions."""
        # Pass the input through all the layers apllying ReLU activation, but the last
        for layer in self.fc_layers[:-1]:
            x = F.relu(layer(x))
        # Pass the result through the output layer apllying hyperbolic tangent function
        x = torch.tanh(self.fc_layers[-1](x))
        # Return the better action for the input state
        return x


class Critic(nn.Module):
    """Critic (Value) Model."""

    def __init__(self, state_size, action_size, seed, fc_layers=[400,300]):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc_layers (list): Number of nodes in hidden layers
        """
        super(Critic, self).__init__()
        self.seed = torch.manual_seed(seed)
        # Append the output size to the layers's dimensions
        dims = fc_layers + [1]
        # Create a list of layers
        layers_list = []
        layers_list.append(nn.Linear(state_size, dims[0]))
        # The second layer receives the the first layer output + action
        layers_list.append(nn.Linear(dims[0] + action_size, dims[1]))
        # Build the next layers, if that is the case
        for dim_in, dim_out in zip(dims[1:-1], dims[2:]):
            layers_list.append(nn.Linear(dim_in, dim_out))
        # Store the layers as a ModuleList
        self.fc_layers = nn.ModuleList(layers_list)
        # Initialize the hidden layer weights
        reset_parameters(self.fc_layers)
        # Add batch normalization to the first hidden layer
        self.bn = nn.BatchNorm1d(dims[0])

        #print('Critic network built:', self.fc_layers)

    def forward(self, state, action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        # Pass the states into the first layer
        x = self.fc_layers[0](state)
        x = self.bn(x)
        x = F.relu(x)
        # Concatenate the first layer output with the action
        x = torch.cat((x, action), dim=1)
        # Pass the input through all the layers apllying ReLU activation, but the last
        for layer in self.fc_layers[1:-1]:
            x = F.relu(layer(x))
        # Pass the result through the output layer apllying sigmoid activation
        x = torch.sigmoid(self.fc_layers[-1](x))
        # Return the Q-Value for the input state-action
        return x

# Replay Buffer Size
BUFFER_SIZE = int(1e6)
# Minibatch Size
BATCH_SIZE = 256
# Discount Gamma
GAMMA = 0.995 
# Soft Update Value
TAU = 1e-2   
# Learning rates for each NN      
LR_ACTOR = 1e-3 
LR_CRITIC = 1e-3
# Update network every X timesteps
UPDATE_EVERY = 32
# Learn from batch of experiences n_experiences times
N_EXPERIENCES = 16   

NOISE_DECAY = 0.999

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")

class DDPGAgent():
    """Interacts with and learns from the environment using the DDPG algorithm."""

    def __init__(self, state_size, action_size, random_seed):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(random_seed)

        # Actor Neural Network (Regular and target)
        self.actor_regular = Actor(state_size, action_size, random_seed).to(device)
        self.actor_target = Actor(state_size, action_size, random_seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_regular.parameters(), lr=LR_ACTOR)

        # Critic Neural Network (Regular and target)
        self.critic_regular = Critic(state_size, action_size, random_seed).to(device)
        self.critic_target = Critic(state_size, action_size, random_seed).to(device)
        self.critic_optimizer = optim.Adam(self.critic_regular.parameters(), lr=LR_CRITIC)

        # Replay memory
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, random_seed)
        
        # Noise process
        self.noise = OUNoise(action_size, random_seed)
        self.noise_decay = NOISE_DECAY
          
        # Ensure that both networks have the same weights
        self.soft_update(self.actor_target, self.actor_regular, TAU)
        self.soft_update(self.critic_target, self.critic_regular, TAU)

    def step(self, state, action, reward, next_state, done, timestep):
        # Save collected experiences
        """Save experience in replay memory, and use random sample from buffer to learn."""
        # Save experience / reward
        self.memory.add(state, action, reward, next_state, done)

        # Learn from our buffer if possible
        if len(self.memory) > BATCH_SIZE and timestep % UPDATE_EVERY == 0:
            for _ in range(N_EXPERIENCES):
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)

    def act(self, states):
        states = torch.from_numpy(states).float().to(device)
        
        # Evaluation mode
        # Notify all your layers that you are in eval mode, that way, 
        # Batchnorm or dropout layers will work in eval mode instead of training mode.
        self.actor_regular.eval()
        # torch.no_grad() impacts the autograd engine and deactivate it. 
        # It will reduce memory usage and speed up
        with torch.no_grad():
            actions = self.actor_regular(states).cpu().data.numpy() + self.noise.get_noise()
        # Enable Training mode
        self.actor_regular.train()

        return actions
    
    def reset(self):
        self.noise.reset()

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences
        
        #--------------------------------
        # Update the critic neural network
        #--------------------------------
        
        # Get predicted next-state actions
        actions_next = self.actor_target(next_states)
        #Get Q values from target model
        #Q_targets_next = self.critic_target(next_states, actions_next)
        
        #Detach tensor from computation graph before performing operations on it:
        Q_targets_next = self.critic_target(next_states, actions_next).detach()

        # Compute Q targets for current states
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        # Calculate the critic loss
        Q_expected = self.critic_regular(states, actions)
        critic_loss = F.mse_loss(Q_expected, Q_targets)

        # Minimize the loss
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        #--------------------------------
        # Update the actor neural network
        #--------------------------------
        
        # Calculate the actor loss
        actions_pred = self.actor_regular(states)
        # Change sign because of the gradient descent
        actor_loss = -self.critic_regular(states, actions_pred).mean()

        # Minimize the loss function
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # Update target network using the soft update approach (slowly updating)
        self.soft_update(self.critic_regular, self.critic_target, TAU)
        self.soft_update(self.actor_regular, self.actor_target, TAU)


    def soft_update(self, regular_model, target_model, tau):
        # Update the target network slowly to improve the stability
        for target_param, regular_param in zip(target_model.parameters(), regular_model.parameters()):
            target_param.data.copy_(tau*regular_param.data + (1.0-tau) * target_param.data)
            
    def reset(self):
        # Reset any necessary variables here, like noise parameters if you have.
        pass

# adding exploration noise to the actions can sometimes improve the speed of training in reinforcement learning environments, especially in the early stages where the agent needs to discover optimal strategies. The reason is that noise encourages exploration of the environment, which can help the agent to avoid getting stuck in sub-optimal policies.
#In DDPG, an Ornstein-Uhlenbeck process is often used to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia.
class OUNoise:
    def __init__(self, size, seed, mu=0., theta=0.15, sigma=0.2):
        self.mu = mu * np.ones(size)
        self.theta = theta
        self.sigma = sigma
        self.seed = random.seed(seed)
        
    def reset(self):
        self.state = copy.copy(self.mu)

    def sample(self):
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.random.standard_normal(size=x.shape)
        self.state = x + dx
        return self.state
    
class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""
    
    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

GOAL = 30.1
SCORE_AVERAGED = 100
PRINT_EVERY = 10
N_EPISODES = 500
MAX_TIMESTEPS = 1000

# reset the environment and extract state and action spaces
env_info = env.reset(train_mode=True)[brain_name]
num_agents = len(env_info.agents)
action_size = brain.vector_action_space_size
states = env_info.vector_observations
state_size = states.shape[1]

# Initialize the agents
agent = DDPGAgent(state_size=state_size, action_size=action_size, random_seed=0)

#  Method for training the agent
def train(n_episodes=N_EPISODES):
    scores_deque = deque(maxlen=SCORE_AVERAGED)
    scores = []
    for i_episode in range(1, N_EPISODES+1):
        agent.noise.reset()
        states = env.reset(train_mode=True)[brain_name].vector_observations
        score_all_agents = np.zeros(num_agents) 
        for t in range(1, MAX_TIMESTEPS+1):
            
            actions = agent.act(states)
            env_info = env.step(actions)[brain_name] 
            next_states = env_info.vector_observations 
            rewards = env_info.rewards                        
            dones = env_info.local_done 
            ## Store experience of all the agents
            for (state, action, reward, next_state, done) \
                    in zip(states, actions, rewards, next_states, dones):
                agent.step(state, action, reward, next_state, done, t)
            states = next_states
            score_all_agents += rewards
        
        scores_deque.append(np.mean(score_all_agents))
        scores.append(np.mean(score_all_agents))
        
        print('\rEpisode {:3d} \tScore: {:5.2f} \t' \
              'Moving average: {:5.2f}' \
              .format(i_episode,  np.mean(score_all_agents), np.mean(scores_deque)), end="")
        
        if i_episode % PRINT_EVERY == 0:
            torch.save(agent.actor_regular.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_regular.state_dict(), 'checkpoint_critic.pth')
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))   
            
        if np.mean(scores_deque) >= GOAL:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode,
                                                                                         np.mean(scores_deque)))
            torch.save(agent.actor_regular.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_regular.state_dict(), 'checkpoint_critic.pth')
            
            break
    return scores

# Train the agent
scores = train()'''

In [None]:
#Plot scores of trained agent
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

## Test the trained weights

In [14]:
import numpy as np
import random
from collections import namedtuple, deque

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt


def reset_parameters(layers):
    for layer in layers:
        layer.weight.data.uniform_(-3e-3,3e-3)

class Actor(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc_layers=[400,300]):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc_layers (list): Number of nodes in hidden layers
        """
        super(Actor, self).__init__()
        self.seed = torch.manual_seed(seed)
        # Define input and output values for the hidden layers
        dims = [state_size] + fc_layers + [action_size]
        # Create the hidden layers
        self.fc_layers = nn.ModuleList(
            [nn.Linear(dim_in, dim_out) for dim_in, dim_out in zip(dims[:-1], dims[1:])])
        # Initialize the hidden layer weights
        reset_parameters(self.fc_layers)

        #print('Actor network built:', self.fc_layers)

    def forward(self, x):
        """Build an actor (policy) network that maps states -> actions."""
        # Pass the input through all the layers apllying ReLU activation, but the last
        for layer in self.fc_layers[:-1]:
            x = F.relu(layer(x))
        # Pass the result through the output layer apllying hyperbolic tangent function
        x = torch.tanh(self.fc_layers[-1](x))
        # Return the better action for the input state
        return x


class Critic(nn.Module):
    """Critic (Value) Model."""

    def __init__(self, state_size, action_size, seed, fc_layers=[400,300]):
        """Initialize parameters and build model.
        Params
        ======
            state_size (int): Dimension of each state
            action_size (int): Dimension of each action
            seed (int): Random seed
            fc_layers (list): Number of nodes in hidden layers
        """
        super(Critic, self).__init__()
        self.seed = torch.manual_seed(seed)
        # Append the output size to the layers's dimensions
        dims = fc_layers + [1]
        # Create a list of layers
        layers_list = []
        layers_list.append(nn.Linear(state_size, dims[0]))
        # The second layer receives the the first layer output + action
        layers_list.append(nn.Linear(dims[0] + action_size, dims[1]))
        # Build the next layers, if that is the case
        for dim_in, dim_out in zip(dims[1:-1], dims[2:]):
            layers_list.append(nn.Linear(dim_in, dim_out))
        # Store the layers as a ModuleList
        self.fc_layers = nn.ModuleList(layers_list)
        # Initialize the hidden layer weights
        reset_parameters(self.fc_layers)
        # Add batch normalization to the first hidden layer
        self.bn = nn.BatchNorm1d(dims[0])

        #print('Critic network built:', self.fc_layers)

    def forward(self, state, action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        # Pass the states into the first layer
        x = self.fc_layers[0](state)
        x = self.bn(x)
        x = F.relu(x)
        # Concatenate the first layer output with the action
        x = torch.cat((x, action), dim=1)
        # Pass the input through all the layers apllying ReLU activation, but the last
        for layer in self.fc_layers[1:-1]:
            x = F.relu(layer(x))
        # Pass the result through the output layer apllying sigmoid activation
        x = torch.sigmoid(self.fc_layers[-1](x))
        # Return the Q-Value for the input state-action
        return x

# Replay Buffer Size
BUFFER_SIZE = int(1e6)
# Minibatch Size
BATCH_SIZE = 256
# Discount Gamma
GAMMA = 0.995 
# Soft Update Value
TAU = 1e-2   
# Learning rates for each NN      
LR_ACTOR = 1e-3 
LR_CRITIC = 1e-3
# Update network every X timesteps
UPDATE_EVERY = 32
# Learn from batch of experiences n_experiences times
N_EXPERIENCES = 16   

#NOISE_DECAY = 0.999

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")

class DDPGAgent():
    """Interacts with and learns from the environment using the DDPG algorithm."""

    def __init__(self, state_size, action_size, random_seed):
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(random_seed)

        # Actor Neural Network (Regular and target)
        self.actor_regular = Actor(state_size, action_size, random_seed).to(device)
        self.actor_target = Actor(state_size, action_size, random_seed).to(device)
        self.actor_optimizer = optim.Adam(self.actor_regular.parameters(), lr=LR_ACTOR)

        # Critic Neural Network (Regular and target)
        self.critic_regular = Critic(state_size, action_size, random_seed).to(device)
        self.critic_target = Critic(state_size, action_size, random_seed).to(device)
        self.critic_optimizer = optim.Adam(self.critic_regular.parameters(), lr=LR_CRITIC)

        # Replay memory
        self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, random_seed)
        
        # Noise process
        #self.noise = OUNoise(action_size, random_seed)
        #self.noise_decay = NOISE_DECAY
          
        # Ensure that both networks have the same weights
        self.soft_update(self.actor_target, self.actor_regular, TAU)
        self.soft_update(self.critic_target, self.critic_regular, TAU)

    def step(self, state, action, reward, next_state, done, timestep):
        # Save collected experiences
        """Save experience in replay memory, and use random sample from buffer to learn."""
        # Save experience / reward
        self.memory.add(state, action, reward, next_state, done)

        # Learn from our buffer if possible
        if len(self.memory) > BATCH_SIZE and timestep % UPDATE_EVERY == 0:
            for _ in range(N_EXPERIENCES):
                experiences = self.memory.sample()
                self.learn(experiences, GAMMA)

    def act(self, states):
        states = torch.from_numpy(states).float().to(device)
        
        # Evaluation mode
        # Notify all your layers that you are in eval mode, that way, 
        # Batchnorm or dropout layers will work in eval mode instead of training mode.
        self.actor_regular.eval()
        # torch.no_grad() impacts the autograd engine and deactivate it. 
        # It will reduce memory usage and speed up
        with torch.no_grad():
            actions = self.actor_regular(states).cpu().data.numpy() #+ self.noise.get_noise()
        # Enable Training mode
        self.actor_regular.train()

        return actions
    
    #def reset(self):
    #    self.noise.reset()

    def learn(self, experiences, gamma):
        states, actions, rewards, next_states, dones = experiences
        
        #--------------------------------
        # Update the critic neural network
        #--------------------------------
        
        # Get predicted next-state actions
        actions_next = self.actor_target(next_states)
        #Get Q values from target model
        #Q_targets_next = self.critic_target(next_states, actions_next)
        
        #Detach tensor from computation graph before performing operations on it:
        Q_targets_next = self.critic_target(next_states, actions_next).detach()

        # Compute Q targets for current states
        Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))

        # Calculate the critic loss
        Q_expected = self.critic_regular(states, actions)
        critic_loss = F.mse_loss(Q_expected, Q_targets)

        # Minimize the loss
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        #--------------------------------
        # Update the actor neural network
        #--------------------------------
        
        # Calculate the actor loss
        actions_pred = self.actor_regular(states)
        # Change sign because of the gradient descent
        actor_loss = -self.critic_regular(states, actions_pred).mean()

        # Minimize the loss function
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # Update target network using the soft update approach (slowly updating)
        self.soft_update(self.critic_regular, self.critic_target, TAU)
        self.soft_update(self.actor_regular, self.actor_target, TAU)


    def soft_update(self, regular_model, target_model, tau):
        # Update the target network slowly to improve the stability
        for target_param, regular_param in zip(target_model.parameters(), regular_model.parameters()):
            target_param.data.copy_(tau*regular_param.data + (1.0-tau) * target_param.data)
            
    def reset(self):
        # Reset any necessary variables here, like noise parameters if you have.
        pass

# adding exploration noise to the actions can sometimes improve the speed of training in reinforcement learning environments, especially in the early stages where the agent needs to discover optimal strategies. The reason is that noise encourages exploration of the environment, which can help the agent to avoid getting stuck in sub-optimal policies.
#In DDPG, an Ornstein-Uhlenbeck process is often used to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia.
'''
class OUNoise:
    def __init__(self, action_space_size, mu=0.0, theta=0.15, sigma=0.2):
        self.action_space_size = action_space_size
        self.mu = mu
        #elf.mu = mu * np.ones(action_space_size)
        self.theta = theta
        self.sigma = sigma
        #self.seed = random.seed(seed)
        #self.reset()
        self.state = np.ones(self.action_space_size) * self.mu

    def reset(self):
        """Reset the internal state (= noise) to mean (mu)."""
        #self.state = copy.copy(self.mu)
        self.state = np.ones(self.action_space_size) * self.mu

    def get_noise(self):
        """Update internal state and return it as a noise sample."""
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(len(x))
        self.state = x + dx
        return self.state

'''    
class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""
    
    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
        """
        self.action_size = action_size
        self.memory = deque(maxlen=buffer_size)  # internal memory (deque)
        self.batch_size = batch_size
        self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
        self.seed = random.seed(seed)

    def add(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        e = self.experience(state, action, reward, next_state, done)
        self.memory.append(e)

    def sample(self):
        """Randomly sample a batch of experiences from memory."""
        experiences = random.sample(self.memory, k=self.batch_size)

        states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
        actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).float().to(device)
        rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
        next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
        dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)

        return (states, actions, rewards, next_states, dones)

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

GOAL = 30.1
SCORE_AVERAGED = 100
PRINT_EVERY = 20
N_EPISODES = 500
MAX_TIMESTEPS = 1000

# reset the environment and extract state and action spaces
env_info = env.reset(train_mode=True)[brain_name]
num_agents = len(env_info.agents)
action_size = brain.vector_action_space_size
states = env_info.vector_observations
state_size = states.shape[1]

# Initialize the agents
agent = DDPGAgent(state_size=state_size, action_size=action_size, random_seed=0)

#  Method for training the agent
def train(n_episodes=N_EPISODES):
    scores_deque = deque(maxlen=SCORE_AVERAGED)
    scores = []
    for i_episode in range(1, N_EPISODES+1):
        #agent.noise.reset()
        states = env.reset(train_mode=True)[brain_name].vector_observations
        score_all_agents = np.zeros(num_agents) 
        for t in range(1, MAX_TIMESTEPS+1):
            
            actions = agent.act(states)
            env_info = env.step(actions)[brain_name] 
            next_states = env_info.vector_observations 
            rewards = env_info.rewards                        
            dones = env_info.local_done 
            ## Store experience of all the agents
            for (state, action, reward, next_state, done) \
                    in zip(states, actions, rewards, next_states, dones):
                agent.step(state, action, reward, next_state, done, t)
            states = next_states
            score_all_agents += rewards
        
        scores_deque.append(np.mean(score_all_agents))
        scores.append(np.mean(score_all_agents))
        
        print('\rEpisode {:3d} \tScore: {:5.2f} \t' \
              'Moving average: {:5.2f}' \
              .format(i_episode,  np.mean(score_all_agents), np.mean(scores_deque)), end="")
        
        if i_episode % PRINT_EVERY == 0:
            torch.save(agent.actor_regular.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_regular.state_dict(), 'checkpoint_critic.pth')
            print('\r\nEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            
        if np.mean(scores_deque) >= GOAL:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode,
                                                                                         np.mean(scores_deque)))
            torch.save(agent.actor_regular.state_dict(), 'checkpoint_actor.pth')
            torch.save(agent.critic_regular.state_dict(), 'checkpoint_critic.pth')
            break
    return scores

In [16]:
agent = DDPGAgent(state_size, action_size, random_seed=0)
agent.actor_regular.load_state_dict(torch.load('checkpoint_actor.pth'))
agent.critic_regular.load_state_dict(torch.load('checkpoint_critic.pth'))

<All keys matched successfully>

In [18]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)

while True:
    actions = agent.act(states)           ## select actions from DDPG policy
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                              # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

Total score (averaged over agents) this episode: 25.855999422073364


## REPORT

### Key Points to Highlight:
- Model Architecture
- Training Process
- Hyperparameters

#### Model Architecture:

Actor Network:

- Receives states as input and produces continuous actions.
- Employs fully connected layers with ReLU activation, except for the output layer which utilizes hyperbolic tangent activation to constrain action values within a defined range.
- Batch normalization is not included in the actor network architecture.

Critic Network:

- Evaluates action quality based on both states and actions.
- Utilizes fully connected layers with ReLU activation, culminating in a sigmoid activation for Q-value output.
- Incorporates batch normalization in the first hidden layer to enhance stability during training.

Shared Characteristics:

- Both networks consist of two hidden layers with 400 and 300 units, respectively.
- Weight initialization for all layers follows a uniform distribution within the range [-3e-3, 3e-3].
- Both networks feature regular and target versions, with target networks updated gradually using a soft update approach.
- PyTorch's ModuleList is utilized for dynamic layer management in both networks.

#### Training Process

- Replay Buffer: Experiences (state, action, reward, next_state, done) are stored and sampled randomly for training stability.

- Learning: The agent learns from experiences sampled from the replay buffer. The critic minimizes the mean squared error between predicted and target Q-values, while the actor maximizes Q-values predicted by the critic for selected actions.

#### Hyperparameters:
- Buffer Size: 1e6
- Batch Size: 256
- Discount Factor (Gamma): 0.995
- Soft Update (TAU): 1e-2
- Learning Rates: 1e-3 for both actor and critic
- Update Frequency: Every 32 timesteps
- Number of Experiences: 16
- Goal Score: 30.1
- Print Frequency: Every 10 episodes
- Max Timesteps per Episode: 1000
- Number of Episodes: 500

#### Future developments
Noise: Noise code implementation (an Ornstein-Uhlenbeck process) is commented out but could improve exploration,  increasing the speed of learning, especially in the earliest stages (noise decay parameter)

Target Network Updates: In learn() method of the DDPGAgent class, the updates of the target networks (actor_target and critic_target) happen every time learn() is called. However, the target networks should typically be updated less frequently to stabilize training. We could consider updating the target networks periodically.

In [None]:
# Close the environment
env.close()