## Curiosity-Driven Exploration in RL

Exploration in RL is still a difficult problem, especially when applied to environments with high dimension state space such as Atari, mario, etc.

Some approaches come from maintaining a [pseudo-count](https://arxiv.org/abs/1606.01868) of state observation and assigning reward bonus for less visited states.

Another interesting approach, and the basis for this notebook implementation, is that of [curiosity driven exploration](https://arxiv.org/abs/1705.05363)

In this work, the agent relies on a compressed (more details below) world model, and assign a reward bonus to exploring states that provide surprise as compared to it's state model. In a very crude approximation of natural curiosity (we are curious when we observe things that don't match with what we expect to happen)

### Test environment

Due to my limited compute, I will be testing the algorithm on the [mountain-car](https://gym.openai.com/envs/MountainCar-v0/) environment. This environment requires a precise set of actions to push the cart to the top, since it requires the agent to use the momentum of yo-yoing between the two side of the the valley to reach the top. 

However one caveat that should be noted is that epsilon greedy exploration can still solve the environment. The good thing is that with curiosity driven exploration, the number of epochs to solve this environment should be much faster than that of say Q-learning, and espeically for on-policy algorithm such as PPO/REINFORCE.

## Curiosity Driven Exploration, ICM Module

At the heart of the paper is the ICM module. I've taken a screen grab of Figure 2 below for reference. With this diagram we can begin the construct the ICM Module.

![picture](img/img1.png)

### components needed

We can that we need to implement 3 components for ICM module. The encoder, forward model and inverse model. Below we will create these modules individually, and then join them together into one called ICM

In [68]:
import torch
from torch.autograd import Variable
from torch import nn
import torch.nn.functional as F
import numpy as np
import gym

class FeatureEncoder(nn.Module):
    """obseravation space encoder.
    
    output_dim should be smaller or equal to obs_dim
    
    """
    def __init__(self, obs_dim, hidden_dim, output_dim, hidden_act=F.tanh):
        super(FeatureEncoder,self).__init__()
        assert(output_dim <= obs_dim), "output_dim should be smaller or equal to input_dim"
        self.hidden_act = hidden_act
        self.w1 = nn.Linear(obs_dim, hidden_dim)
        self.w2 = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        x = self.hidden_act(self.w1(x))
        x = self.w2(x) # linear output to let output to be any range of values
        return x
    
class ForwardModel(nn.Module):
    """ The forward model, which takes action and encoded observation as input and predicts
    the next encoded state.
    
    """
    def __init__(self, embedded_dim, hidden_dim, action_dim, hidden_act=F.tanh):
        super(ForwardModel, self).__init__()
        self.hidden_act = hidden_act
        self.w1 = nn.Linear(embedded_dim+action_dim, hidden_dim)
        self.w2 = nn.Linear(hidden_dim, embedded_dim)
        
    def forward(self, x):
        x = self.hidden_act(self.w1(x))
        x = self.w2(x)
        return x # linear output to let output to be any range of values
        
class InverseModel(nn.Module):
    """The inverse model takes the encoded pre and post states (s_t and s_t+1) and predicts
    and action given these. FOr now we make our model to be for discrete actions only, will
    update it for general later.
    
    """
    def __init__(self, embedded_dim, hidden_dim, action_dim, 
                 action_type='discrete', hidden_act=F.tanh):
        super(InverseModel, self).__init__()
        self.hidden_act = hidden_act
        self.action_type = action_type
        self.w1 = nn.Linear(embedded_dim*2, hidden_dim)
        self.w2 = nn.Linear(hidden_dim, action_dim)
        
    def forward(self, x):
        x = self.hidden_act(self.w1(x))
        x = self.w2(x)
        return x
        
        
class ICM(object):
    """Overall ICM model encoporating all the above.
    
    """
    def __init__(self, obs_dim, embedded_dim, action_dim, 
                 hidden_dim, reward_scale=0.5, hidden_act=F.tanh):
        self.obs_dim = obs_dim
        self.embedded_dim = embedded_dim
        self.action_dim = action_dim
        self.hidden_dim = hidden_dim
        self.reward_scale = reward_scale
        self.encoder = FeatureEncoder(obs_dim, hidden_dim, embedded_dim, hidden_act)
        self.i_model = InverseModel(embedded_dim, hidden_dim, act_dim)
        self.f_model = ForwardModel(embedded_dim, hidden_dim, act_dim)
        
    def predict_reward(self, action, pre_state, post_state):
        """provides a scalar curiosity reward. This reward is the mse between our embedding
        prediction.
        
        Also computes losses and return it with reward, used for loss computation
        
        """
        pre_embed = self.encoder(pre_state)
        post_embed = self.encoder(post_state)
        action_base = Variable(torch.zeros(self.action_dim))
        # get one hot encoding action
        action_base[action] = 1.0
        # concatenate pre_embedding and action to feed to our forward model
        f_model_feed = torch.cat([pre_embed.view(1,-1), action_base.view(1,-1)],dim=1)
        f_model_pre_embed = self.f_model(f_model_feed)
        
        # compute inverse model action, need concatenate pre_embedding and post embedding
        i_model_feed = torch.cat([pre_embed.view(1,-1), post_embed.view(1,-1)], dim=1)
        
        # compute action loss, for discrete action space this is the negative_log_loss/cross entorpy loss
        i_model_action_pred = self.i_model(i_model_feed) # note this is unnormalized prediction
        action_pred = F.softmax(i_model_action_pred, dim=1) # softmax action distribution
        action_pred_log = F.log_softmax(i_model_action_pred, dim=1) # stable log_softmax
        action_loss = F.nll_loss(action_pred, action) # Negative log-likelihood loss
        
        # compute the embedding prediction loss, this is the MSE loss between our forward model
        # prediction of embedding versus actual embedding obtained
        # note in this case we're holding the post embedding as a constant, we can't have both
        # prediction and target in our loss to be tunable.
        embedding_loss = F.smooth_l1_loss(f_model_pre_embed, Variable(post_embed.data))
        
        
        
        
        return embedding_loss, action_loss
        
        
        
        
        

In [69]:
env = gym.make("MountainCar-v0")
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.shape[0]
embedded_dim = 1
icm = ICM(obs_dim, embedded_dim, act_dim, 10)
sample_obs = Variable(torch.FloatTensor(env.observation_space.sample()).view(1,-1))
sample_action = Variable(torch.LongTensor([env.action_space.sample()]))

icm.predict_reward(sample_action, sample_obs, sample_obs)

[2018-04-11 14:47:04,918] Making new env: MountainCar-v0


(Variable containing:
 1.00000e-03 *
   7.9127
 [torch.FloatTensor of size 1], Variable containing:
 -0.3067
 [torch.FloatTensor of size 1])

In [63]:
F.smooth_l1_loss?