# CE-40719: Deep Learning
## HW6 - Deep Reinforcement Learning
(20 points)

#### Name: Seyed Shayan Nazemi
#### Student No.: 98209037

In this assignment we are going to train a simple Actor-Critic model to solve classical control problems. We are going to use a batch version of the standard [gym](https://gym.openai.com/) library that is given to you in `multi_env.py`. The only difference between these two versions is that in `multi_env.py` instead of a single environment we have a batch of environments, therefore the observations are in shape `(batch_size * observation_size)`. We will focus on `CartPole-v1` problem but you can apply this to other problems as well.

## Algorithm

The vanilla actor-critic algorithm is as follows:

1.   Sample a batch $\{(s_i, a_i, r_i, s_{i + 1})\}_i$ under policy $\pi_\theta$.
2.   Fit $V_\phi^{\pi_\theta}(s_i)$ to $r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})$ by minimizing squared error $\|r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V_\phi^{\pi_\theta}(s_i)\|^2$.
3. $\max_{\theta}~ \sum_{i} \log \pi_\theta(a_i|s_i) \left[ r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V^{\pi_\theta}_\phi(s_i) \right]$

We need two parametrized models, one for value function $V^{\pi_\theta}_\phi$ and one for stochastic policy $\pi_\theta$. Since both $\pi_\theta$ and $V^{\pi_\theta}_\phi$ are functions of state $s$, instead of modeling each with a seperate neural network, we can model both with a single network with shared parameters. In other words we train a single network that outputs both $\pi_\theta(a|s)$ and $V^{\pi_\theta}_\phi(s)$. To train this network we combine step 2 and 3 in the main algoritm and optimize the following objective:
$$\min_{\theta, \phi}~ -\sum_{i} \log \pi_\theta(a_i|s_i) \left[ r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V^{\pi_\theta}_\phi(s_i) \right] + \|r_i + \gamma V_\phi^{\pi_\theta}(s_{i+1})- V_\phi^{\pi_\theta}(s_i)\|^2$$

Note that the gradient must be backpropagated only through $\log \pi_\theta(a_i|s_i)$ and $V_\phi^{\pi_\theta}(s_i)$ in the squared error. A negative entropy term $-\mathcal{H} (\pi_\theta(a_i|s_i))$ can also be added to above objective to encourage exploration. 

## Setup

In [240]:
import gym
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributions as dist

from multi_env import SubprocVecEnv

In [241]:
env_name = 'CartPole-v1'
num_envs = 16

def make_env():
    def _thunk():
        env = gym.make(env_name)
        return env

    return _thunk

envs = [make_env() for i in range(num_envs)]
envs = SubprocVecEnv(envs)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 1. Model (8 Points)

To define a stochastic policy we use [`torch.distributions`](https://pytorch.org/docs/stable/distributions.html) module. Networks shared parameters are defined in a simple MLP. Network has two heads, one for $V$ that takes in MLPs output and outputs a scalar, and one for $\pi$ that takes in the MLPs output and outputs a categorical distribution for each action. 

In [242]:
class ActorCritic(nn.Module):
    def __init__(self, state_size, hidden_size, num_actions):
        super(ActorCritic, self).__init__()
        #################################################################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################
        # state_size: size of the input state
        # hidden_size: a list containing size of each mlp hidden layer in order
        # num_action: number of actions
        # do not use batch norm for any layer in this network
        #################################################################################
        self.fc1 = nn.Linear(state_size, hidden_size, bias=False)
        self.fc2 = nn.Linear(hidden_size, 32, bias=False)

        self.fc_policy = nn.Linear(32, num_actions, bias=False)
        self.fc_value = nn.Linear(32, 1, bias=False)

        pass
        #################################################################################
        #                                   THE END                                     #
        #################################################################################


    def forward(self, x):
        #################################################################################
        #                          COMPLETE THE FOLLOWING SECTION                       #
        #################################################################################

        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        value = self.fc_value(x)
        policy = dist.Categorical(F.softmax(self.fc_policy(x), dim=-1))
        pass
        #################################################################################
        #                                   THE END                                     #
        #################################################################################
        return policy, value

In [243]:
def test_model(model):
    env = gym.make(env_name)
    total_reward = 0
    #################################################################################
    #                          COMPLETE THE FOLLOWING SECTION                       #
    #################################################################################
    # run given model for a single episode and compute total reward.
    #################################################################################
    done = False
    obs = [torch.FloatTensor(env.reset()).to(device)] * num_state_obs

    while done == False:
        state = torch.cat(obs, dim=0).to(device).unsqueeze(0)

        policy_dist, value = model(state)
        action = policy_dist.sample()

        next_state, reward, done, _ = env.step(int(action.cpu().numpy()))
        next_state = torch.FloatTensor(next_state).to(device)
        obs = [*obs[1:], next_state]
    
        total_reward += reward
    pass
    #################################################################################
    #                                   THE END                                     #
    #################################################################################
    return total_reward

## 2. Objective and Training (12 Points)

A single observation is not always enough to understand state of an environment, hence we take previous `num_state_obs` observations at time t as state of the environment at time t. Initialize and train the model using Adam optimizer. You should be able to get to 500 in less than 20000 iterations.

In [244]:
#################################################################################
#                          COMPLETE THE FOLLOWING SECTION                       #
#################################################################################
# experiment with different parameters and models to get the best result
#################################################################################
num_iterations = 40000
num_state_obs = 10
gamma = 0.99

obs_size = 10
state_size = num_state_obs * envs.observation_space.shape[0]
num_actions = envs.action_space.n

model = ActorCritic(state_size, 64, num_actions)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
#################################################################################
#                                   THE END                                     #
#################################################################################

In [245]:
obs = [torch.FloatTensor(envs.reset())] * num_state_obs
for t in range(num_iterations):
    model.train()
    #################################################################################
    #                          COMPLETE THE FOLLOWING SECTION                       #
    #################################################################################
    # implement the algorithm
    #################################################################################
    model.to(device)
    
    state = torch.cat(obs, dim=1).to(device)
    policy_dist, value = model(state)
    action = policy_dist.sample()

    next_state, reward, done, _ = envs.step(action.cpu().numpy())
    next_state = torch.FloatTensor(next_state)
    
    obs = [*obs[1:], next_state]

    _, value_next = model(torch.cat(obs, dim=1).to(device))
    done_mask = torch.tensor(1 - done, dtype=torch.float, device=device).unsqueeze(1)

    reward = torch.FloatTensor(reward).to(device).unsqueeze(1)

    Q_value = reward + done_mask * (gamma * value_next)
    advantage = Q_value - value

    loss = -torch.sum(policy_dist.log_prob(action).unsqueeze(0) * advantage.detach()) + F.mse_loss(Q_value.detach(), value, reduction='sum')

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if t % 1000 == 999:
        obs = [torch.FloatTensor(envs.reset())] * num_state_obs

    pass
    #################################################################################
    #                                   THE END                                     #
    #################################################################################
    if t % 1000 == 999:
        print('iteration {:5d}: average reward = {:5f}'.format(t + 1, np.mean([test_model(model) for _ in range(10)])))

iteration  1000: average reward = 17.100000
iteration  2000: average reward = 20.100000
iteration  3000: average reward = 13.400000
iteration  4000: average reward = 19.900000
iteration  5000: average reward = 39.500000
iteration  6000: average reward = 33.200000
iteration  7000: average reward = 117.600000
iteration  8000: average reward = 46.200000
iteration  9000: average reward = 55.700000
iteration 10000: average reward = 64.500000
iteration 11000: average reward = 88.100000
iteration 12000: average reward = 139.900000
iteration 13000: average reward = 55.400000
iteration 14000: average reward = 36.000000
iteration 15000: average reward = 12.000000
iteration 16000: average reward = 88.300000
iteration 17000: average reward = 500.000000
iteration 18000: average reward = 500.000000
iteration 19000: average reward = 61.800000
iteration 20000: average reward = 500.000000
iteration 21000: average reward = 500.000000
iteration 22000: average reward = 21.400000
iteration 23000: average r