# 1. Trading Agent
Class to implement a policy gradient trading agent, who is in charge of finding a policy which maximizes the mean of the cumulative reward function of the portfolio. The agent is a NN whose output is the weight vector $\vec{w}$ or action.

This class implements the NN agent, and computes the portfolio equations so as to calculate the portfolio value and the reward the agent is going to maximize.

## 1.1 Portolio features:
The portfolio features are the tensors that characterize the portfolio:

* Relative price tensor $Y_t$: Composed by the relative price vectors of the 3 features (future_price vector = closing price($v_t$)/opening price($v_{t-1}$)), where the shape is $[Bathces, f, m]$, where f are the features, and m the non cash assets.

* Relative price vector ($y_t$): Relative prices of the closing price (feature 0). Shape $[Batches, 1+m]$. It is a rank 2 tensor, and it can be seen as a vector (rank 1 tensor) for each batch ($n_b$ samples/periods).

* Future_weight_vec ($w'_t$): is the portfolio weight vector at the end of the trading period. It is given by:
$$w'_t = \frac{\vec{y}_t\vec{w}_{t-1}}{\sum_{i=1}^m y_{t,i}\cdot w_{t-1,i}};\; \mathrm{Shape}\; [Batches, 1+m]$$


* Logarithmic rate of return ($r_t$) or immediate reward: $\log{\mu_t y_t \cdot w_{t-1}}$. Shape $[Batches, 1]$

* Portfolio value vector (__pv_vector): Portfolio value for each batch (the value of the portfolio after computing the action with n_b samples)
    - $[Batch]$ rank 1 tensor (vector):  There is a value per batch.
    - Portfolio value ($P_f$): is the value of the portfolio anfter $\Delta t = t_f-t_0$ periods:
$$P_{t_f} = P_0 \exp \left( \sum _{t=1} ^{t_f + 1} r_t \right) = P_0 \prod _{t=1} ^{t_f+1} \mu_t \vec{y}_t \cdot \vec{w}_{t-1}; \; \mathrm{Shape}\; \mathrm{It\; is\; a\; scalar. Shape\; []}$$ 

* Cumulative reward function ($R$): is what is going to be maximize. It is given by the average of logarithmic cumulated return
$$R(s_1, a_1, \dots, s_{t_f}, a_{t_f}, s_{t_f+1}) = \frac{1}{t_f}\log \left(\frac{P_f}{P_0}\right) = \sum _{t=1}^{t_f+1}\log (\mu_t\vec{y}_t\cdot \vec{w}_{t-1}) = \frac{1}{t_f}\sum_{t=1}^{t_f+1}r_t; \; \mathrm{Shape}\; [Batches,1]$$


## 1.2 Training the trading agent aka NN

1. Creation of the agent object which is the instance of the CNN class. 
2. Fed the batch input tensor, the relative price tensor $Y_t$, the previous weight vector, and the number of batches ($N_b$) into the NN. 
3. The policy network will be trained against $N_b$ randomly chosen mini-batches from this set of n previous periods. Each mini-batch contains $n_b$ samples/periods of the data.
4. A batch starting with period tb $t_0 − n_b$ is picked with a geometrically distributed probability (ReplayBuffer.ipynb).
5.  It is important that prices inside a batch are in time-order.
6. __set_loss_function: Sets the loss function also called objective function. It is the function that the agent wants to minimize so as to update the parameters. This function is the -reward, cause the agent looks forward maximizing the reward.
7. init_train: Define which is going to be the optimizer used to minimize the loss (train_step.)
8. train: calls evaluate_tensors, checking that the tensors fed into the NN have no nan value, and trains the NN with the previously defined optimizer.
9. decide_by_history: Once the NN outputs the action, this function runs the computational graph defined in the constructor, returning each of its values (features of the portfolio) updated.


In [None]:
class DPG_LogReturn:

    def __init__(self, feature_number, num_assets, window_size, sess, optimizer, 
                 trading_cost=trading_cost, interest_rate=interest_rate,):

        # parameters
        self.trading_cost = trading_cost
        self.interest_rate = interest_rate
        self.feature_number = feature_number
        self.window = window_size
        self.num_assets = num_assets
                    
            
        with tf.variable_scope("Inputs"):

            # Tensor of the prices
            self.X_t = tf.placeholder(tf.float32, [None, self.num_features, self.m, self.n]) 
            # Weights at the previous time step: this is the w_t' of the previous period 
            self.W_previous = tf.placeholder(tf.float32, [None, self.m+1])
            # Portfolio value at the previous time step
            self.pf_value_previous = tf.placeholder(tf.float32, [None, 1])
            # Vector of y_t = Open(t+1)/Open(t) fluctuation of prices during session t
            self.dailyReturn_t = tf.placeholder(tf.float32, [None, self.m])
            

        with tf.variable_scope("Policy_Model"):
             # shape of the tensor == batchsize
            shape_X_t = tf.shape(self.X_t)[0]
           
            with tf.variable_scope("Conv1"):
                # first layer on the X_t tensor
                self.conv1 = tf.layers.conv2d(
                    inputs = tf.transpose(self.X_t, perm= [0, 2, 3, 1]),  # [Batches, assets, periods, features]
                    activation = tf.nn.relu,
                    filters = 2,  
                    strides = (1, 1),
                    kernel_size = (1,2),
                    padding = 'valid')
                
                
            with tf.variable_scope("Conv2"):
                filter_cols = network.get_shape()[2]
                #feature maps
                self.conv2 = tf.layers.conv2d(
                    inputs = self.conv1,
                    activation = tf.nn.relu,
                    filters = 10,  # 20 filters
                    strides=(1, 1),
                    kernel_size=(1, filter_cols), # To compute just one asset at a time
                    padding='valid')

            
            with tf.variable_scope("Conv3"):
                width = self.conv2.get_shape()[2]     # Window number
                height = self.conv2.get_shape()[1]    # Asset number 
                features = self.conv2.get_shape()[3]  # Feature number
                network = tf.reshape(self.conv2, [self.input_num, int(height), 1, int(width*features)])
                w = tf.reshape(self.previous_w, [-1, int(height), 1, 1])  # [last batch, assets, 1, 1] 
                network = tf.concat([network, w], axis=3)                 # [last batch, assets, 1, metwork[3]+1]                
                network = tf.layers.conv2d(inputs = network,
                                                activation = tf.nn.relu,
                                                filters = 1,
                                                strides = (1, 1),
                                                kernel_size=(1, 1),
                                                kernel_regularizer=tf.contrib.layers.l2_regularizer(5*e**(-8))
                                                padding='valid')
                
                network = network[:, :, 0, 0]
                # bias [1,1]
                cash_bias = tf.get_variable("cash_bias", [1, 1], dtype=tf.float32, initializer=tf.zeros_initializer)
                # self.add_layer_to_dict(layer["type"], network, weights=False)
                cash_bias = tf.tile(cash_bias, [self.input_num, 1])  # Builds a tensor by tiling a given tensor
                network = tf.concat([cash_bias, network], 1)        # concatenates adding cols (the number of rows does not change)
                self.voting = network  # voting scores
                self.action = tf.nn.softmax(network)
                
                
            
            with tf.variable_scope("LogReward"):
                
                # FUTURE RELATIVE PRICERANK 2 TENSOR y: relative prices considering just the closing prices (feature 0) at t
                y_t = tf.concat([cash_return, self.dailyReturn_t], axis=1)

                # PORTFOLIO VALUE VECTOR: Actually here we are computing P_t/P_{t-1} = expr(r_t)
                # [Batches, ]
                # pv_vector: portfolio value for each batch (portfolio value after computing the action with n_b samples)
                # The operations are computed over all the periods (samples) in a batch 
                self.pv_vector = tf.reduce_sum(self.action * y_t, reduction_indices=[1]) * self.compute_mu() 

                # PORTFOLIO VALUE: P(t)/P(t-bs)=exp(sum(_(t-bs)^t) r_t) = (prod(_(t-bs)^t)w_t*y_t)
                # Result of multiplying all the elements of pv_vector (returns for each sample)
                # Equation of the portfolio value explained above 
                self.portfolio_value = tf.reduce_prod(self.pv_vector) 
                self.mean = tf.reduce_mean(self.pv_vector)            # Mean of the portfolio value vector (through all the batches)
                self.reward = tf.reduce_mean(tf.log(self.pv_vector))  # Cumulated return (eq 22)
                self.loss_function = self.set_loss_function()         # Loss function to train the NN

                ## Evaluate performance
                self.standard_deviation = tf.sqrt(tf.reduce_mean((self.pv_vector - self.mean) ** 2))
                self.sharp_ratio = (self.mean - 1) / self.standard_deviation
                        
                
        # Objective function: maximize reward over the batch (min(-r) = max(r))
        self.train_op = optimizer.minimize(-self.loss_function)
        
        # some bookkeeping
        self.optimizer = optimizer
        self.sess = sess
        
    # Transaction remainder factor 
    def compute_mu(self):
        c = self.trading_cost
        # Starts in [:,1:] to not consider the cash in the calculations
        return 1-tf.reduce_sum(tf.abs(self.out[:,1:]-self.W_previous[:,1:]),axis=1)*c  # [Batches,]
   
    
    # Define the loss function which is going to minimize the agent (so as to maximize the reward)
    def set_loss_function(self):
        
        # Minimizes the minus the cumulated reward (maximizes the reward)
        def loss_function4():
            return -tf.reduce_mean(tf.log(tf.reduce_sum(self.action[:] * self.__y,
                                                        reduction_indices=[1])))
        # Adds regularization
        def loss_function5():
            return -tf.reduce_mean(tf.log(tf.reduce_sum(self.action * self.__y, reduction_indices=[1]))) + \
                   LAMBDA * tf.reduce_mean(tf.reduce_sum(-tf.log(1 + 1e-6 - self.action), reduction_indices=[1]))

        # Minimizes minus the portfolio value (maximizes the portfolio value)
        def loss_function6():
            return -tf.reduce_mean(tf.log(self.pv_vector))

        # Adds regularization
        def loss_function7():
            return -tf.reduce_mean(tf.log(self.pv_vector)) + \
                   LAMBDA * tf.reduce_mean(tf.reduce_sum(-tf.log(1 + 1e-6 - self.action), reduction_indices=[1]))

        # Considers the differences between previous weight vector and the computed one times comision ratio
        def with_last_w():
            return -tf.reduce_mean(tf.log(tf.reduce_sum(self.action[:] * self.__y, reduction_indices=[1])
                                          -tf.reduce_sum(tf.abs(self.action[:, 1:] - self.previous_w)
                                                         *self.trading_cost, reduction_indices=[1])))

        loss_function = loss_function5
        loss_tensor = loss_function()
        regularization_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
        if regularization_losses:
            for regularization_loss in regularization_losses:
                loss_tensor += regularization_loss
        return loss_tensor

    # Compute the agent's action   
    def compute_W(self, X_t_, W_previous_):
        return self.sess.run(tf.squeeze(self.action), feed_dict={self.X_t: X_t_, self.previous_w: W_previous_})

    # Train the NN maximizing the reward: the input is a batch of the differents values
    def train(self, X_t_, W_previous_, pf_value_previous_, dailyReturn_t_):
     
        self.sess.run(self.train_op, feed_dict={self.X_t: X_t_,                             
                                                self.W_previous: W_previous_,
                                                self.pf_value_previous: pf_value_previous_,
                                                self.dailyReturn_t: dailyReturn_t_})