# CE-40719: Deep Learning
## HW6 - Deep Reinforcement Learning
(20 points)

#### Name: Seyed Shayan Nazemi
#### Student No.: 98209037

In this assignment we are going to train a simple Actor-Critic model to solve classical control problems. We are going to use a batch version of the standard [gym](https://gym.openai.com/) library that is given to you in `multi_env.py`. The only difference between these two versions is that in `multi_env.py` instead of a single environment we have a batch of environments, therefore the observations are in shape `(batch_size * observation_size)`. We will focus on `CartPole-v1` problem but you can apply this to other problems as well.

## Algorithm

The vanilla actor-critic algorithm is as follows:

1.   Sample a batch $\{(s_i, a_i, r_i, s_{i + 1})\}_i$ under policy $\pi_\theta$.
2.   Fit $V_\phi^{\pi_\theta}(s_i)$ to $r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})$ by minimizing squared error $\|r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V_\phi^{\pi_\theta}(s_i)\|^2$.
3. $\max_{\theta}~ \sum_{i} \log \pi_\theta(a_i|s_i) \left[ r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V^{\pi_\theta}_\phi(s_i) \right]$

We need two parametrized models, one for value function $V^{\pi_\theta}_\phi$ and one for stochastic policy $\pi_\theta$. Since both $\pi_\theta$ and $V^{\pi_\theta}_\phi$ are functions of state $s$, instead of modeling each with a seperate neural network, we can model both with a single network with shared parameters. In other words we train a single network that outputs both $\pi_\theta(a|s)$ and $V^{\pi_\theta}_\phi(s)$. To train this network we combine step 2 and 3 in the main algoritm and optimize the following objective:
$$\min_{\theta, \phi}~ -\sum_{i} \log \pi_\theta(a_i|s_i) \left[ r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V^{\pi_\theta}_\phi(s_i) \right] + \|r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V_\phi^{\pi_\theta}(s_i)\|^2$$

Note that the gradient must be backpropagated only through $\log \pi_\theta(a_i|s_i)$ and $V_\phi^{\pi_\theta}(s_i)$ in the squared error. A negative entropy term $-\mathcal{H} (\pi_\theta(a_i|s_i))$ can also be added to above objective to encourage exploration. 

## Setup

In [8]:
import gym
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributions as dist

from multi_env import SubprocVecEnv

In [9]:
env_name = 'CartPole-v1'
num_envs = 16

def make_env():
    def _thunk():
        env = gym.make(env_name)
        return env

    return _thunk

envs = [make_env() for i in range(num_envs)]
envs = SubprocVecEnv(envs)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 1. Model (8 Points)

To define a stochastic policy we use [`torch.distributions`](https://pytorch.org/docs/stable/distributions.html) module. Networks shared parameters are defined in a simple MLP. Network has two heads, one for $V$ that takes in MLPs output and outputs a scalar, and one for $\pi$ that takes in the MLPs output and outputs a categorical distribution for each action. 

In [10]:
class ActorCritic(nn.Module):
    def __init__(self, state_size, hidden_size, num_actions):
        super(ActorCritic, self).__init__()
        #################################################################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
        # state_size: size of the input state
        # hidden_size: a list containing size of each mlp hidden layer in order
        # num_action: number of actions
        # do not use batch norm for any layer in this network
        #################################################################################
        self.fc1 = nn.Linear(state_size, hidden_size, bias=False)
        self.fc2 = nn.Linear(hidden_size, 64, bias=False)
        self.fc_policy = nn.Linear(64, num_actions, bias=False)
        self.fc_value = nn.Linear(64, 1, bias=False)

        pass
        #################################################################################
        #                                   THE END                                     #
        #################################################################################


    def forward(self, x):
        #################################################################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################

        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        value = self.fc_value(x)
        policy = dist.Categorical(F.softmax(self.fc_policy(x), dim=-1))
        pass
        #################################################################################
        #                                   THE END                                     #
        #################################################################################
        return policy, value

In [11]:
def test_model(model):
    env = gym.make(env_name)
    total_reward = 0
    #################################################################################
    #                          COMPLETE THE FOLLOWING SECTION                       #
    #################################################################################
    # run given model for a single episode and compute total reward.
    #################################################################################
    state = torch.FloatTensor(env.reset()).to(device)

    for i in range(num_state_obs):
        policy_dist, value = model(state)
        action = policy_dist.sample()
        next_state, reward, done, _ = env.step(action.cpu().numpy())

        total_reward += reward

    pass
    #################################################################################
    #                                   THE END                                     #
    #################################################################################
    return total_reward

## 2. Objective and Training (12 Points)

A single observation is not always enough to understand state of an environment, hence we take previous `num_state_obs` observations at time t as state of the environment at time t. Initialize and train the model using Adam optimizer. You should be able to get to 500 in less than 20000 iterations.

In [12]:
#################################################################################
#                          COMPLETE THE FOLLOWING SECTION                       #
#################################################################################
# experiment with different parameters and models to get the best result
#################################################################################
num_iterations = 20000
num_state_obs = 10
gamma = 0.9

obs_size = 10
state_size = envs.observation_space.shape[0]
num_actions = envs.action_space.n

model = ActorCritic(state_size, 32, num_actions)
optimizer = optim.Adam(model.parameters(), lr=1e-4)
#################################################################################
#                                   THE END                                     #
#################################################################################

In [13]:
obs = [torch.FloatTensor(envs.reset())] * num_state_obs
for t in range(num_iterations):
    model.train()
    #################################################################################
    #                          COMPLETE THE FOLLOWING SECTION                       #
    #################################################################################
    # implement the algorithm
    #################################################################################
    model.to(device)
    state = torch.FloatTensor(envs.reset()).to(device)
    log_probs = []
    advantages = []
    for j in range(num_state_obs):
        policy_dist, value = model(state)
        action = policy_dist.sample()
        next_state, reward, done, _ = envs.step(action.cpu().numpy())

        next_state = torch.FloatTensor(next_state).to(device)
        reward = torch.FloatTensor(reward).to(device).unsqueeze(1)

        _, value_next = model(next_state)

        log_probs.append(policy_dist.log_prob(action).unsqueeze(0))

        advantages.append(reward + gamma * value_next - value)


    advantages = torch.cat(advantages, dim=1).T
    log_probs = torch.cat(log_probs, dim=0)

    loss = -torch.sum(advantages * torch.tensor(log_probs)) + torch.sum(advantages ** 2)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    pass
    if t % 100 == 99:
        print(model.fc1.weight.grad)
        print('Iteration {} completed'.format(t + 1))
    #################################################################################
    #                                   THE END                                     #
    #################################################################################
    if t % 1000 == 999:
        print('iteration {:5d}: average reward = {:5f}'.format(t + 1, np.mean([test_model(model) for _ in range(10)])))



tensor([[-5.4868e-03, -5.2799e-01,  1.9864e-01,  6.9485e-01],
        [-2.2951e-01, -3.4979e+00,  8.2297e-02,  5.4413e+00],
        [ 2.7312e-01,  8.1202e+00, -4.8525e-01, -1.2055e+01],
        [ 1.5533e-01,  2.9443e+00, -2.7836e-01, -4.7317e+00],
        [-2.6582e-01, -4.0346e+00,  2.9960e-01,  6.2475e+00],
        [ 1.5177e-01,  4.5559e-01,  1.5791e-01, -8.1179e-01],
        [-2.7544e-02, -2.4983e-01,  2.1951e-02,  3.6914e-01],
        [ 2.0376e-02,  9.1713e-01,  1.4748e-01, -1.6200e+00],
        [-6.9059e-03,  4.7822e-01, -1.1735e-01, -6.8259e-01],
        [-3.8540e-01, -4.3144e+00,  2.9819e-02,  7.3992e+00],
        [ 3.3739e-01,  4.5447e+00,  1.5432e-01, -7.5700e+00],
        [-2.0483e-02, -5.3495e-01,  5.6584e-02,  8.0352e-01],
        [ 8.3707e-02, -2.3004e-01,  1.9538e-01,  3.4440e-01],
        [ 3.0985e-02,  1.3563e+00, -2.8421e-01, -2.1775e+00],
        [-8.1880e-02, -5.5703e-01, -7.6461e-02,  1.0989e+00],
        [-2.6691e-01, -4.4342e+00,  6.8057e-02,  7.0923e+00],
        



tensor([[-0.5662, -4.9925,  0.5026,  5.4057],
        [ 0.8298, -1.8896,  0.3841,  4.1313],
        [ 0.4834,  0.7032, -1.1454, -0.7109],
        [-2.8227, -0.8374, -1.4862, -3.3100],
        [-0.6829, -4.7444,  0.3685,  4.7000],
        [ 2.6743, -1.7216,  1.7885,  1.7240],
        [-0.8645, -1.9354, -5.0312,  2.9194],
        [ 1.8384, -0.2210, -0.5980,  1.7519],
        [ 0.3072, -0.0482,  0.0381,  0.1937],
        [-0.9084, -6.3411, -6.7893,  9.2532],
        [-0.6841,  1.4811,  2.5990, -3.0229],
        [ 1.6718,  0.4524,  0.6907,  1.0875],
        [ 1.2355, -1.7896,  3.8759,  5.8392],
        [ 0.5865,  0.2401,  0.1813,  0.0514],
        [ 1.3745, -0.2437, -0.3980,  1.2176],
        [ 0.7118, -1.9852, -2.5203,  1.3566],
        [-0.2740, -0.0293,  0.0777, -0.2594],
        [-0.9891, -4.4613,  0.9096,  3.7641],
        [-2.4433, -0.5965, -0.3505, -2.3679],
        [-0.8594, -2.0386, -2.3210,  3.1921],
        [-1.3416, -0.3010, -0.4069, -1.7730],
        [-0.1798, -2.2059, -5.2764

KeyboardInterrupt: ignored