# Collaboration and Competition

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the third project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Tennis.app"`
- **Windows** (x86): `"path/to/Tennis_Windows_x86/Tennis.exe"`
- **Windows** (x86_64): `"path/to/Tennis_Windows_x86_64/Tennis.exe"`
- **Linux** (x86): `"path/to/Tennis_Linux/Tennis.x86"`
- **Linux** (x86_64): `"path/to/Tennis_Linux/Tennis.x86_64"`
- **Linux** (x86, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86"`
- **Linux** (x86_64, headless): `"path/to/Tennis_Linux_NoVis/Tennis.x86_64"`

For instance, if you are using a Mac, then you downloaded `Tennis.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Tennis.app")
```

In [2]:
env = UnityEnvironment(file_name="../unity_ml_envs/Tennis_Windows_x86_64/Tennis.exe")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: TennisBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 8
        Number of stacked Vector Observation: 3
        Vector Action space type: continuous
        Vector Action space size (per agent): 2
        Vector Action descriptions: , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1.  If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01.  Thus, the goal of each agent is to keep the ball in play.

The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents 
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 2
Size of each action: 2
There are 2 agents. Each observes a state with length: 24
The state for the first agent looks like: [ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.         -6.65278625 -1.5
 -0.          0.          6.83172083  6.         -0.          0.        ]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agents and receive feedback from the environment.

Once this cell is executed, you will watch the agents' performance, if they select actions at random with each time step.  A window should pop up that allows you to observe the agents.

Of course, as part of the project, you'll have to change the code so that the agents are able to use their experiences to gradually choose better actions when interacting with the environment!

In [5]:
n_games = 10
for i in range(1, n_games):                                      # play game for 5 episodes
    env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
    states = env_info.vector_observations                  # get the current state (for each agent)
    scores = np.zeros(num_agents)                          # initialize the score (for each agent)
    while True:
        actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
        actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
        env_info = env.step(actions)[brain_name]           # send all actions to tne environment
        next_states = env_info.vector_observations         # get next state (for each agent)
        rewards = env_info.rewards                         # get reward (for each agent)
        if np.asarray(rewards).any() != 0:
            print(f"Rewards: {rewards}")
        dones = env_info.local_done                        # see if episode finished
        scores += env_info.rewards                         # update the score (for each agent)
        states = next_states                               # roll over states to next time step
        if np.any(dones):                                  # exit loop if episode finished
            break
    # print(f"Rewards: {rewards}")
    print('Rewards: {}. Score (max over agents) from episode {}: {}'.format(rewards, i, np.max(scores)))

Rewards: [0.0, -0.009999999776482582]
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 1: 0.0
Rewards: [0.0, -0.009999999776482582]
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 2: 0.0
Rewards: [0.0, 0.10000000149011612]
Rewards: [0.0, -0.009999999776482582]
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 3: 0.09000000171363354
Rewards: [0.0, -0.009999999776482582]
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 4: 0.0
Rewards: [0.0, -0.009999999776482582]
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 5: 0.0
Rewards: [-0.009999999776482582, 0.0]
Rewards: [-0.009999999776482582, 0.0]. Score (max over agents) from episode 6: 0.0
Rewards: [0.0, -0.009999999776482582]
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 7: 0.0
Rewards: [-0.009999999776482582, 0.0]
Rewards: [-0.009999999776482582, 0.0]. Score (max over agents) 

When finished, you can close the environment.

In [6]:
env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

# Solving the environment with Proximal Policy Optimizatino (PPO)


This PPO algorithm was implemented for the reacher multi-agent environment. It wasn't a collaboration or competition setting, and each agent performed an independent task.


## Multi-Agent PPO
From the perspective of each player, it doesn't matter that the other is an PPO or another player. We can make the same policy play against itself, though we need to make sure that the observations are not flipped to each other (since this would confuse the learning agent)

- **Rewards** are given in relation to each player, so there's no need to modify this.
- **Environment obseravtions** need to be the same for both agents, and not absolute in relation to the game, so we need to investigate it

# TODO: investigate if the observation vector is symetrical between the players


Here are the assumptions:
* The observation state of each agent is having a length of 24
* The observation state contains values about:
    * The position and velocity of the ball
    * The position and velocity of the agent
    * The position and velocity of the opponent
* The horizontal component of the state could be multiplied by -1 to flip the board symmetrically to a vertical axis:
    * horizontal position and velocity of the ball
    * horizontal position and velocity of the agent
    * horizontal position and velocity of the opponent



## Hyperparameters

In [11]:
# Training Hyperparameters
EPISODES = 1000         # Number of episodes to train for
# MAX_T = 2048          # Max length of trajectory
MAX_T = 1000            # Max length of trajectory
SGD_EPOCHS = 4          # Number of gradient descent steps per batch of experiences
BATCH_SIZE = 32         # minibatch size
BETA = 0.01             # entropy regularization parameter
GRADIENT_CLIP = 5       # gradient clipping parameter

# optimizer parameters
# LR = 5e-4               # learning rate
LR = 1e-4               # learning rate
# EPSILON = 1e-5          # optimizer epsilon
WEIGHT_DECAY=1.E-4      # L2 weight decay

# PPO parameters
GAMMA = 0.99            # Discount factor
TAU = 0.95              # GAE parameter
PPO_CLIP_EPSILON = 0.1  # ppo clip parameter

## Actor and Critic networks

In [8]:


# Agent and models
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# # Agent hyperparameters
# BATCH_SIZE = 32         # minibatch size
# GAMMA = 0.99            # Discount factor
# TAU = 0.95              # GAE parameter
# BETA = 0.01             # entropy regularization parameter
# PPO_CLIP_EPSILON = 0.2  # ppo clip parameter
# GRADIENT_CLIP = 5       # gradient clipping parameter


class Actor(nn.Module):
    def __init__(self, state_size, action_size, hidden_size=64):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, action_size)
        
    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        return F.tanh(self.fc3(x)) # still the same, since the range is [-1.0, +1.0]
    
    
class Critic(nn.Module):
    def __init__(self, state_size, value_size=1, hidden_size=64):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(state_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, value_size)
        
    def forward(self, states):
        x = F.relu(self.fc1(states))
        x = F.relu(self.fc2(x))
        return self.fc3(x)
    
    
class ActorCritic(nn.Module):
    def __init__(self, state_size, action_size, value_size=1, hidden_size=64, std=0.1):
        super(ActorCritic, self).__init__()
        self.actor = Actor(state_size, action_size, hidden_size)
        self.critic = Critic(state_size, value_size, hidden_size)
        
        self.log_std = nn.Parameter(torch.ones(1, action_size)*std)
        
    def forward(self, states):
        obs = torch.FloatTensor(states)
        
        # Critic
        values = self.critic(obs)
        
        # Actor
        mu = self.actor(obs)
        std = self.log_std.exp().expand_as(mu)
        dist = torch.distributions.Normal(mu, std)
        
        return dist, values
    


## Agent

In [9]:
    
class Agent():
    def __init__(self, num_agents, state_size, action_size,
                 batch_size=32,
                 weight_decay=1.e-4,
                 gradient_clip=5,
                 std=0.1,
                 value_size=1,
                 hidden_size=64):
        
        self.num_agents = num_agents
        self.state_size = state_size
        self.action_size = action_size
        self.batch_size = batch_size
        self.gradient_clip = gradient_clip
        self.model = ActorCritic(state_size, action_size, value_size=value_size, hidden_size=hidden_size, std=std)
        self.optimizer = optim.Adam(self.model.parameters(), lr=LR, weight_decay=weight_decay)
        
        
        self.model.train()
        
    def act(self, states):
        """Remember: states are state vectors for each agent
        It is used when collecting trajectories
        """
        dist, values = self.model(states) # pass the state trough the network and get a distribution over actions and the value of the state
        actions = dist.sample() # sample an action from the distribution
        log_probs = dist.log_prob(actions) # calculate the log probability of that action
        log_probs = log_probs.sum(-1).unsqueeze(-1) # sum the log probabilities of all actions taken (in case of multiple actions) and reshape to (batch_size, 1)
        
        return actions, log_probs, values
    
    def learn(self, states, actions, log_probs_old, returns, advantages, sgd_epochs=4):
        """ Performs a learning step given a batch of experiences
        
        Remmeber: in the PPO algorithm, we perform SGD_episodes (usually 4) weights update steps per batch
        using the proximal policy ratio clipped objective function
        """        

        num_batches = states.size(0) // self.batch_size
        for i in range(sgd_epochs):
            batch_count = 0
            batch_ind = 0
            for i in range(num_batches):
                sampled_states = states[batch_ind:batch_ind+self.batch_size, :]
                sampled_actions = actions[batch_ind:batch_ind+self.batch_size, :]
                sampled_log_probs_old = log_probs_old[batch_ind:batch_ind+self.batch_size, :]
                sampled_returns = returns[batch_ind:batch_ind+self.batch_size, :]
                sampled_advantages = advantages[batch_ind:batch_ind+self.batch_size, :]
                
                L = ppo_loss(self.model, sampled_states, sampled_actions, sampled_log_probs_old, sampled_returns, sampled_advantages)
                
                self.optimizer.zero_grad()
                (L).backward()
                nn.utils.clip_grad_norm_(self.model.parameters(), self.gradient_clip)
                self.optimizer.step()
                
                batch_ind += self.batch_size
                batch_count += 1



# Loss

In [57]:
# Loss function. NOT INTEGRATED YET
def ppo_loss(model, states, actions, log_probs_old, returns, advantages, clip_epsilon=0.1, c1=0.5, beta=0.01):
    dist, values = model(states)
    
    log_probs = dist.log_prob(actions)
    log_probs = torch.sum(log_probs, dim=1, keepdim=True)
    entropy = dist.entropy().mean()
    
    # r(θ) =  π(a|s) / π_old(a|s)
    ratio = (log_probs - log_probs_old).exp() # NOTE WHYYYYYYY????
    
    # Surrogate Objctive : L_CPI(θ) = r(θ) * A
    obj = ratio * advantages 
    
    # clip ( r(θ), 1-Ɛ, 1+Ɛ )*A
    obj_clipped = ratio.clamp(1.0 - clip_epsilon, 1.0 + clip_epsilon) * advantages
    
    # L_CLIP(θ) = E { min[ r(θ)A, clip ( r(θ), 1-Ɛ, 1+Ɛ )*A ] - β * KL }
    policy_loss = -torch.min(obj, obj_clipped).mean(0) - beta * entropy.mean() # NOTE: WHY ARE WE TAKING THE MEAN AGAIN???
    
    # L_VF(θ) = ( V(s) - V_t )^2
    value_loss = c1 * (returns - values).pow(2).mean()
    
    return policy_loss + value_loss

## Advantage Calculation

In [58]:
def calculate_advantages(rollout, returns, num_agents, gamma=0.99, tau=0.95):
    """ Given a rollout, calculates the advantages for each state """
    num_steps = len(rollout) - 1
    processed_rollout = [None] * num_steps
    advantages = torch.zeros((num_agents, 1))

    for i in reversed(range(num_steps)):
        states, value, actions, log_probs, rewards, dones = map(lambda x: torch.Tensor(x), rollout[i])
        next_value = rollout[i + 1][1]

        dones = dones.unsqueeze(1)
        rewards = rewards.unsqueeze(1)

        # Compute the updated returns
        returns = rewards + gamma * dones * returns

        # Compute temporal difference error
        td_error = rewards + gamma * dones * next_value.detach() - value.detach()
        
        advantages = advantages * tau * gamma * dones + td_error
        processed_rollout[i] = [states, actions, log_probs, returns, advantages]

    # Concatenate along the appropriate dimension
    states, actions, log_probs_old, returns, advantages = map(lambda x: torch.cat(x, dim=0), zip(*processed_rollout))
    
    # Normalize advantages
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
    
    return states, actions, log_probs_old, returns, advantages

## Training

In [59]:
import numpy as np
from collections import deque
import torch
from src.utils import test_agent
import os

def collect_trajectories(env, brain_name, agent, max_t):
    env_info = env.reset(train_mode=True)[brain_name]
    num_agents = len(env_info.agents)
    states = env_info.vector_observations
        
    rollout = []
    agents_rewards = np.zeros(num_agents)
    episode_rewards = []

    for _ in range(max_t):
        actions, log_probs, values = agent.act(states)
        env_info = env.step(actions.cpu().detach().numpy())[brain_name]
        next_states = env_info.vector_observations
        rewards = env_info.rewards 
        dones = np.array([1 if t else 0 for t in env_info.local_done])
        agents_rewards += rewards

        for j, done in enumerate(dones):
            if dones[j]:
                episode_rewards.append(agents_rewards[j])
                agents_rewards[j] = 0

        rollout.append([states, values.detach(), actions.detach(), log_probs.detach(), rewards, 1 - dones])

        states = next_states

    pending_value = agent.model(states)[-1]
    returns = pending_value.detach() 
    rollout.append([states, pending_value, None, None, None, None])
    
    return rollout, returns, episode_rewards, np.mean(episode_rewards)


def train(env, brain_name, agent, num_agents, n_episodes, max_t, run_name="testing_01"):
    print(f"Starting training...")
    env.info = env.reset(train_mode = True)[brain_name]
    all_scores = []
    all_scores_window = deque(maxlen=100)
    best_so_far = 0.0
        
    for i_episode in range(n_episodes):
        # Each iteration, N parallel actors collect T time steps of data
        rollout, returns, _, _ = collect_trajectories(env, brain_name, agent, max_t)
        
        states, actions, log_probs_old, returns, advantages = calculate_advantages(rollout, returns, num_agents)
        # print(f"States: {states.shape}. Actions: {actions.shape}. Log_probs_old: {log_probs_old.shape}. Returns: {returns.shape}. Advantages: {advantages.shape}")
        agent.learn(states, actions, log_probs_old, returns, advantages)
        
        test_mean_reward = test_agent(env, agent, brain_name)

        all_scores.append(test_mean_reward)
        all_scores_window.append(test_mean_reward)

        if np.mean(all_scores_window) > best_so_far:
            if not os.path.isdir(f"./ckpt/{run_name}/"):
                os.mkdir(f"./ckpt/{run_name}/")
            torch.save(agent.model.state_dict(), f"./ckpt/{run_name}/ppo_checkpoint_{np.mean(all_scores_window)}.ckpt")
            best_so_far = np.mean(all_scores_window)
            if np.mean(all_scores_window) > 30:
                
                print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(all_scores_window)))
                # break       
        
        print('Episode {}, Total score this episode: {}, Last {} average: {}'.format(i_episode + 1, test_mean_reward, min(i_episode + 1, 100), np.mean(all_scores_window)) )
        
    return all_scores

In [20]:

import time
env_info = env.reset(train_mode=True)[brain_name]
time.sleep(2)

# Environment variables
num_agents = len(env_info.agents)
state_size = env_info.vector_observations.shape[1]
action_size = brain.vector_action_space_size
print(f"Number of agents: {num_agents}. State size: {state_size}. Action size: {action_size}")
# Instantiate the agent
agent = Agent(num_agents, state_size, action_size)

# Train the agent
train(env, brain_name, agent, num_agents, EPISODES, MAX_T)
env.close()

Starting training...




Episode 1, Total score this episode: -0.004999999888241291, Last 1 average: -0.004999999888241291
Episode 2, Total score this episode: -0.004999999888241291, Last 2 average: -0.004999999888241291
Episode 3, Total score this episode: -0.004999999888241291, Last 3 average: -0.004999999888241291
Episode 4, Total score this episode: -0.004999999888241291, Last 4 average: -0.004999999888241291
Episode 5, Total score this episode: -0.004999999888241291, Last 5 average: -0.004999999888241291
Episode 6, Total score this episode: 0.09500000160187483, Last 6 average: 0.011666667026778063
Episode 7, Total score this episode: -0.004999999888241291, Last 7 average: 0.009285714610346727
Episode 8, Total score this episode: -0.004999999888241291, Last 8 average: 0.007500000298023224
Episode 9, Total score this episode: -0.004999999888241291, Last 9 average: 0.006111111388438278
Episode 10, Total score this episode: -0.004999999888241291, Last 10 average: 0.005000000260770321
Episode 11, Total score t

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Episode 169, Total score this episode: -0.004999999888241291, Last 100 average: 0.1114500018581748
Episode 170, Total score this episode: 0.9950000150129199, Last 100 average: 0.12145000200718642
Episode 171, Total score this episode: 0.19500000309199095, Last 100 average: 0.12295000202953815
Episode 172, Total score this episode: -0.004999999888241291, Last 100 average: 0.12245000202208757
Episode 173, Total score this episode: 0.6450000097975135, Last 100 average: 0.12795000210404395
Episode 174, Total score this episode: 0.6950000105425715, Last 100 average: 0.1349500022083521
Episode 175, Total score this episode: 0.6450000097975135, Last 100 average: 0.14145000230520963
Episode 176, Total score this episode: -0.004999999888241291, Last 100 average: 0.14145000230520963
Episode 177, Total score this episode: 0.09500000160187483, Last 100 average: 0.1424500023201108
Episode 178, Total score this episode: 0.09500000160187483, Last 100 average: 0.1429500023275614
Episode 179, Total sco

# Explanation

This PPO algorithm was implemented for the reacher multi-agent environment. It was not a collaboration or competition setting, and each agent performed an independent task.


For this Tennis environment, at each time step both agents take an action and receive the new state. Since the rewards are given in relation to each, the environment is already set to run the same algorithm as the Reacher. 

Both agents use the same policy, and for each episode, we collect 2 trajectories, one for each player.

# Test trained models

In [6]:
import time
# TEst
def test_agent(env, agent, brain_name, test_episodes=10):
    env_info = env.reset(train_mode = True)[brain_name]
    num_agents = len(env_info.agents)
    scores = np.zeros(num_agents)
    for i in range(test_episodes):
        states = env_info.vector_observations
        while True:
            actions, _, _= agent.act(states)
            env_info = env.step(actions.cpu().detach().numpy())[brain_name]
            next_states = env_info.vector_observations
            rewards = env_info.rewards
            # if np.asarray(rewards).any() != 0:
            # print(f"Rewards: {rewards}")
            dones = env_info.local_done
            scores += env_info.rewards
            states = next_states
            if np.any(dones):
                break
            # time.sleep(1/1000000000)
        env_info = env.reset(train_mode = True)[brain_name]
        print('Rewards: {}. Score (max over agents) from episode {}: {}'.format(rewards, i, np.max(scores)))
    return np.mean(scores)

def load_trained_agent(env, ckpt_file):
    brain_name = env.brain_names[0]
    brain = env.brains[brain_name]    
    env_info = env.reset(train_mode=False)[brain_name]
    num_agents = len(env_info.agents)
    state_size = env_info.vector_observations.shape[1]
    action_size = brain.vector_action_space_size
    agent = Agent(num_agents, state_size, action_size)
    
    # policy_solution = torch.load('../ckpt/ppo_checkpoint_39.050999127142134.pth')
    policy_solution = torch.load(ckpt_file)
    agent.model.load_state_dict(policy_solution)
    # agent.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    
    return agent



In [12]:
ckpt_file = "./ckpt/testing_01/ppo_checkpoint_1.6348500244598836.ckpt"
agent = load_trained_agent(env, ckpt_file)
test_agent(env, agent, brain_name)



Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 0: 0.10000000149011612
Rewards: [-0.009999999776482582, 0.0]. Score (max over agents) from episode 1: 1.6900000255554914
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 2: 3.1800000481307507
Rewards: [-0.009999999776482582, 0.0]. Score (max over agents) from episode 3: 4.780000071972609
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 4: 6.0700000915676355
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 5: 6.860000103712082
Rewards: [-0.009999999776482582, 0.0]. Score (max over agents) from episode 6: 7.860000118613243
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 7: 8.850000133737922
Rewards: [0.0, 0.0]. Score (max over agents) from episode 8: 11.45000017248094
Rewards: [0.0, -0.009999999776482582]. Score (max over agents) from episode 9: 11.570000173524022


11.555000173859298

In [13]:
env.close()

In [None]:
# Training Hyperparameters
EPISODES = 150         # Number of episodes to train for
# MAX_T = 2048          # Max length of trajectory
MAX_T = 1000            # Max length of trajectory
SGD_EPOCHS = 4          # Number of gradient descent steps per batch of experiences
BATCH_SIZE = 32         # minibatch size
BETA = 0.01             # entropy regularization parameter
GRADIENT_CLIP = 5       # gradient clipping parameter

# optimizer parameters
LR = 3e-4               # learning rate
# EPSILON = 1e-5          # optimizer epsilon
WEIGHT_DECAY = 1.e-4      # L2 weight decay


# PPO parameters
GAMMA = 0.99            # Discount factor
TAU = 0.95              # GAE parameter
PPO_CLIP_EPSILON = 0.1  # ppo clip parameter

# max_ts = [200, 500, 1000, 1500]
max_ts = [1000]
# sgd_epochs = [2, 4, 6, 10]
sgd_epochs = [4]
# lrs = [1e-3, 1e-4, 1e-5]
# ppos_clip_epsilons = [0.1, 0.2]
ppos_clip_epsilons = [0.1]
# stds = [0.01, 0.05, 0.1, 0.15]
stds = [0.01]
# LRS = [1e-3, 1e-4, 1e-5]
LRS = [3e-4]



from src.utils import save_training_scores

def test_multiple():
    
    for max_t in max_ts:
        for sgd_epoch in sgd_epochs:
            for ppo_clip_epsilon in ppos_clip_epsilons:
                print(f"Training with max_t: {max_t}, sgd_epoch: {sgd_epoch}, ppo_clip_epsilon: {ppo_clip_epsilon}")
                
                # Training Hyperparameters
                EPISODES = 1000         # Number of episodes to train for
                # MAX_T = 2048          # Max length of trajectory
                MAX_T = max_t            # Max length of trajectory
                SGD_EPOCHS = sgd_epoch          # Number of gradient descent steps per batch of experiences
                BATCH_SIZE = 32         # minibatch size
                BETA = 0.01             # entropy regularization parameter
                GRADIENT_CLIP = 5       # gradient clipping parameter

                # optimizer parameters
                LR = 3e-4               # learning rate
                EPSILON = 1e-5          # optimizer epsilon

                # PPO parameters
                GAMMA = 0.99            # Discount factor
                TAU = 0.95              # GAE parameter
                PPO_CLIP_EPSILON = ppo_clip_epsilon  # ppo clip parameter

                # Instantiate the agent
                agent = Agent(num_agents, state_size, action_size)

                # Train the agent
                scores = train(env, brain_name, agent, num_agents, EPISODES, MAX_T, run_name=f"max_t_{max_t}_sgd_epoch_{sgd_epoch}_ppo_clip_epsilon_{ppo_clip_epsilon}")
                env.close()
                
                # Save data for plotting later!!!
                training_name = f"max_t_{max_t}_sgd_epoch_{sgd_epoch}_ppo_clip_epsilon_{ppo_clip_epsilon}"
                # scores.insert(0, training_name)
                save_training_scores(scores, f"./training_scores/{training_name}.csv")
                
                
test_multiple()