# Introduction

Here is the jupyter notebook for those who have signed up for the competition in NTU.

The competition's goal is to maximize the cummulative environment-given reward.
The key differenet of RL from traditional supervised learning is the loss(here indicates the cummulative reward) could not be optimized directly and therefore the updates of the nets(agents) is a little different.

We either use policy-based methods where we integrate the possibility of taking the action and the reward together to update or we use value based method where a judger will determine whether a single state or action is good.

In this competition, I will provide a RL environment which an agent can interact with and RL baseline. Besides, I am going to uncover the core code the RL baseline and show you the how you can build your own agent. You can either choose to build your own agent or you can adjust the hyperparameters to achieve the most cumulative reward.

# Installation

Here we provide the [video tutorial](https://www.youtube.com/watch?v=7rtqFT9I4uo&t=12s) for you to install this project to better participate in the competition.

# RL enviornment & baselines

First, we need to add the project to our system path because the setting of jupter notebook is a little different from that of py document.
Then, we need to load the document we need to import the module we need

In [8]:
import sys
sys.path.append("..")
from agent.EIIE.model import EIIE_con, EIIE_lstm, EIIE_rnn, EIIE_critirc
import argparse
from agent.EIIE.util import *
from env.PM.portfolio_for_EIIE import Tradingenv
from logging import raiseExceptions
from stat import S_ENFMT
import torch.nn as nn
import pandas as pd
import sys
from agent.EIIE.trader import trader


Below is a part where you can adjust the default hyperparameters

In [9]:
parser = argparse.ArgumentParser()
parser.add_argument("--random_seed",
                    type=int,
                    default=12345,
                    help="the path for storing the downloaded data")
parser.add_argument(
    "--env_config_path",
    type=str,
    default="config/input_config/env/portfolio/portfolio_for_EIIE/",
    help="the path for storing the downloaded data")
parser.add_argument(
    "--net_type",
    choices=["conv", "lstm", "rnn"],
    default="conv",
    help="the name of the model",
)
parser.add_argument(
    "--num_hidden_nodes",
    type=int,
    default=32,
    help="the number of hidden nodes in lstm or rnn",
)
parser.add_argument(
    "--num_out_channel",
    type=int,
    default=2,
    help="the number of channel",
)
parser.add_argument(
    "--gamma",
    type=float,
    default=0.99,
    help="the gamma for DPG",
)
parser.add_argument(
    "--model_path",
    type=str,
    default="result/EIIE/trained_model",
    help="the path for trained model",
)
parser.add_argument(
    "--result_path",
    type=str,
    default="result/EIIE/test_result",
    help="the path for test result",
)
parser.add_argument(
    "--num_epoch",
    type=int,
    default=10,
    help="the number of epoch we train",
)


_StoreAction(option_strings=['--num_epoch'], dest='num_epoch', nargs=None, const=None, default=10, type=<class 'int'>, choices=None, help='the number of epoch we train', metavar=None)

In [10]:
args = parser.parse_args(args=[])
agent=trader(args)

In [11]:
agent.train_with_valid()
agent.test()

KeyboardInterrupt: 

The next is core code of the agent.
## RL environment

In [12]:
train_env_instance=agent.train_env_instance
valid_env_instance=agent.valid_env_instance
test_env_instance=agent.train_env_instance


# setting for RL environment

There are 3 RL environment in all, if you want to build and train your own agent, you can train it in the train_env_instance, pick the best model for the valid_env_instance and back test in the test_env_instance.

The action space of the environment is a 30-dimension numpy array, which represents the score of cash+29 stocks.

The observation space of the environment is numpy array of the shape (29, 10, 11), which represents the number of the tickers, the length of the day and the number of features respectively, which means that there are 29 stocks and each state contains the daily price information for 10 days, and the price information contains 11 technical indicator.

In [15]:
# Reseting the environment, means to clear all the history and start from the begining. 
# It will return the initial state  
# Here is an example of posing the random action to the train_env_instance
s=train_env_instance.reset()
action=np.random.rand(16)
done=False
while not done:
    old_state = s
    s, reward, done, _ =train_env_instance.step(action)

KeyboardInterrupt: 

next is the core code of EIIE, I will decompose it so that you can understand the process of training, making it easier for you to build your own agent

In [None]:
# define the net
# Just like the supervised leanrning process, we need a net to regress something, here the EIIE is Actor-Critic RL model where we need a actor to to generate policy and a critic\
# to judge whether the state is good or not

import torch
from torch import nn
import numpy as np


class EIIE_con(torch.nn.Module):
    def __init__(self, in_channels, out_channels, length, kernel_size=3):
        super(EIIE_con, self).__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.length = length
        self.act = torch.nn.ReLU(inplace=False)
        self.con1d = nn.Conv1d(self.in_channels,
                               self.out_channels,
                               kernel_size=3)
        self.con2d = nn.Conv1d(self.out_channels,
                               1,
                               kernel_size=self.length - self.kernel_size + 1)
        self.con3d = nn.Conv1d(1, 1, kernel_size=1)
        self.para = torch.nn.Parameter(torch.ones(1).requires_grad_())

    def forward(self, x):
        x = x.permute(0, 2, 1)
        x = self.con1d(x)
        x = self.act(x)
        x = self.con2d(x)
        x = self.act(x)
        x = self.con3d(x)
        x = x.view(-1)

        # self.linear2 = nn.Linear(len(x), len(x) + 1)
        # x = self.linear2(x)
        x = torch.cat((x, self.para), dim=0)
        x = torch.softmax(x, dim=0)

        return x


class EIIE_lstm(nn.Module):
    def __init__(self, n_features, layer_num, n_hidden):
        super(EIIE_lstm, self).__init__()
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.n_layers = layer_num
        self.lstm = nn.LSTM(input_size=n_features,
                            hidden_size=self.n_hidden,
                            num_layers=self.n_layers,
                            batch_first=True)
        self.linear = nn.Linear(self.n_hidden, 1)
        self.con3d = nn.Conv1d(1, 1, kernel_size=1)
        self.para = torch.nn.Parameter(torch.ones(1).requires_grad_())

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        x = self.linear(lstm_out[:, -1, :]).view(-1, 1, 1)
        x = self.con3d(x)
        x = x.view(-1)
        x = torch.cat((x, self.para), dim=0)
        x = torch.softmax(x, dim=0)
        return x


class EIIE_rnn(nn.Module):
    def __init__(self, n_features, layer_num, n_hidden):
        super(EIIE_rnn, self).__init__()
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.n_layers = layer_num
        self.rnn = nn.RNN(input_size=n_features,
                          hidden_size=self.n_hidden,
                          num_layers=self.n_layers,
                          batch_first=True)
        self.linear = nn.Linear(self.n_hidden, 1)
        self.con3d = nn.Conv1d(1, 1, kernel_size=1)
        self.para = torch.nn.Parameter(torch.ones(1).requires_grad_())

    def forward(self, x):
        lstm_out, _ = self.rnn(x)
        x = self.linear(lstm_out[:, -1, :]).view(-1, 1, 1)
        x = self.con3d(x)
        x = x.view(-1)
        x = torch.cat((x, self.para), dim=0)
        x = torch.softmax(x, dim=0)
        return x


class EIIE_critirc(nn.Module):
    def __init__(self, n_features, layer_num, n_hidden):
        super(EIIE_critirc, self).__init__()
        self.n_features = n_features
        self.n_hidden = n_hidden
        self.n_layers = layer_num
        self.lstm = nn.LSTM(input_size=n_features,
                            hidden_size=self.n_hidden,
                            num_layers=self.n_layers,
                            batch_first=True)
        self.linear = nn.Linear(self.n_hidden, 1)
        self.con3d = nn.Conv1d(1, 1, kernel_size=1)
        self.para = torch.nn.Parameter(torch.ones(1).requires_grad_())

    def forward(self, x, a):
        lstm_out, _ = self.lstm(x)
        x = self.linear(lstm_out[:, -1, :]).view(-1, 1, 1)
        x = self.con3d(x)
        x = x.view(-1)
        x = torch.cat((x, self.para, a), dim=0)
        x = torch.nn.ReLU(inplace=False)(x)
        number_nodes = len(x)
        self.linear2 = nn.Linear(number_nodes, 1)
        x = self.linear2(x)
        return x

In [16]:
# here are the code for the trader, the key lies on the learn function
class trader:
    def __init__(self):
        self.num_epoch = 10
        self.GPU_IN_USE = torch.cuda.is_available()
        self.device = torch.device('cpu' if self.GPU_IN_USE else 'cpu')
        self.model_path = "result/EIIE/trained_model"
        if not os.path.exists(self.model_path):
            os.makedirs(self.model_path)
        self.result_path = "result/EIIE/test_result"
        if not os.path.exists(self.result_path):
            os.makedirs(self.result_path)
        self.train_env_instance = train_env_instance
        self.valid_env_instance = valid_env_instance
        self.test_env_instance = test_env_instance
        self.day_length = 10
        self.input_channel = 11
        self.net = EIIE_con(self.input_channel, 2,
                           self.day_length)
        
        self.critic = EIIE_critirc(self.input_channel, 1,
                                  32)
        self.test_action_memory = []  # to store the
        self.optimizer_actor = torch.optim.Adam(self.net.parameters(), lr=1e-4)
        self.optimizer_critic = torch.optim.Adam(self.critic.parameters(),
                                                 lr=1e-4)
        self.memory_counter = 0
        self.memory_capacity = 1000
        self.s_memory = []
        self.a_memory = []
        self.r_memory = []
        self.sn_memory = []
        self.policy_update_frequency = 500
        self.critic_learn_time = 0
        self.gamma = 0.99
        self.mse_loss = nn.MSELoss()
        self.net = self.net.to(self.device)
        self.critic = self.critic.to(self.device)

    def store_transition(
        self,
        s,
        a,
        r,
        s_,
    ):  # 定义记忆存储函数 (这里输入为一个transition)

        self.memory_counter = self.memory_counter + 1
        if self.memory_counter < self.memory_capacity:
            self.s_memory.append(s)
            self.a_memory.append(a)
            self.r_memory.append(r)
            self.sn_memory.append(s_)
        else:
            number = self.memory_counter % self.memory_capacity
            self.s_memory[number - 1] = s
            self.a_memory[number - 1] = a
            self.r_memory[number - 1] = r
            self.sn_memory[number - 1] = s_

    def compute_single_action(self, state):
        state = torch.from_numpy(state).float().to(self.device)
        action = self.net(state)
        action = action.detach().cpu().numpy()
        return action

    def learn(self):
        # here is the core of the trader, it shows how the updates coming out
        # we first need to have some stored the transcation(s,a,r,s_) 
        length = len(self.s_memory)
        out1 = random.sample(range(length), int(length / 10))
        # random sample
        s_learn = []
        a_learn = []
        r_learn = []
        sn_learn = []
        for number in out1:
            s_learn.append(self.s_memory[number])
            a_learn.append(self.a_memory[number])
            r_learn.append(self.r_memory[number])
            sn_learn.append(self.sn_memory[number])
        self.critic_learn_time = self.critic_learn_time + 1
        # for the transcation we have stored, we need to update the actor and critic
        # for the actor, we need to comput the action and use the critic to judge the action
        # we need to update the actor so that for every action it choose, it can gain more scores from a critic than other action 
        # for the critic , we simply use the td_error to update it because it is MDP

        for bs, ba, br, bs_ in zip(s_learn, a_learn, r_learn, sn_learn):
            #update actor
            a = self.net(bs)
            q = self.critic(bs, a)
            a_loss = -torch.mean(q)
            self.optimizer_actor.zero_grad()
            a_loss.backward(retain_graph=True)
            self.optimizer_actor.step()
            #update critic
            a_ = self.net(bs_)
            q_ = self.critic(bs_, a_.detach())
            q_target = br + self.gamma * q_
            q_eval = self.critic(bs, ba.detach())
            # print(q_eval)
            # print(q_target)
            td_error = self.mse_loss(q_target.detach(), q_eval)
            # print(td_error)
            self.optimizer_critic.zero_grad()
            td_error.backward()
            self.optimizer_critic.step()

    def train_with_valid(self):
        rewards_list = []
        for i in range(self.num_epoch):
            j = 0
            done = False
            s = self.train_env_instance.reset()
            while not done:

                old_state = s
                action = self.net(torch.from_numpy(s).float())
                s, reward, done, _ = self.train_env_instance.step(
                    action.detach().numpy())
                self.store_transition(
                    torch.from_numpy(old_state).float().to(self.device),
                    action,
                    torch.tensor(reward).float().to(self.device),
                    torch.from_numpy(s).float().to(self.device))
                j = j + 1
                if j % 200 == 1:

                    self.learn()
            all_model_path = self.model_path + "/all_model/"
            best_model_path = self.model_path + "/best_model/"
            if not os.path.exists(all_model_path):
                os.makedirs(all_model_path)
            if not os.path.exists(best_model_path):
                os.makedirs(best_model_path)
            torch.save(self.net,
                       all_model_path + "actor_num_epoch_{}.pth".format(i))
            torch.save(self.critic,
                       all_model_path + "critic_num_epoch_{}.pth".format(i))
            s = self.valid_env_instance.reset()
            done = False
            rewards = 0
            while not done:

                old_state = s
                action = self.net(torch.from_numpy(s).float())
                s, reward, done, _ = self.valid_env_instance.step(
                    action.detach().numpy())
                rewards = rewards + reward
            rewards_list.append(rewards)
        index = rewards_list.index(np.max(rewards_list))
        actor_model_path = all_model_path + "actor_num_epoch_{}.pth".format(
            index)
        critic_model_path = all_model_path + "critic_num_epoch_{}.pth".format(
            index)
        self.net = torch.load(actor_model_path)
        self.critic = torch.load(critic_model_path)
        torch.save(self.net, best_model_path + "actor.pth")
        torch.save(self.critic, best_model_path + "critic.pth")

    def test(self):
        s = self.test_env_instance.reset()
        done = False
        while not done:
            old_state = s
            action = self.net(torch.from_numpy(s).float())
            s, reward, done, _ = self.test_env_instance.step(
                action.detach().numpy())
        df_return = self.test_env_instance.save_portfolio_return_memory()
        df_assets = self.test_env_instance.save_asset_memory()
        assets = df_assets["total assets"].values
        daily_return = df_return.daily_return.values
        df = pd.DataFrame()
        df["daily_return"] = daily_return
        df["total assets"] = assets
        if not os.path.exists(self.result_path):
            os.makedirs(self.result_path)
        df.to_csv(self.result_path + "/result.csv")

In [17]:
agent=trader()
agent.train_with_valid()
agent.test()

KeyboardInterrupt: 