<a href="https://colab.research.google.com/github/AI4Finance-Foundation/FinRL-Meta/blob/master/tutorials/1-Introduction/FinRL_PortfolioAllocation_NeurIPS_2020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Reinforcement Learning for Stock Trading from Scratch: Portfolio Allocation

Tutorials to use OpenAI DRL to perform portfolio allocation in one Jupyter Notebook | Presented at NeurIPS 2020: Deep RL Workshop

* This blog is based on our paper: FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance, presented at NeurIPS 2020: Deep RL Workshop.
* Check out medium blog for detailed explanations: https://towardsdatascience.com/finrl-for-quantitative-finance-tutorial-for-portfolio-allocation-9b417660c7cd
* Please report any issues to our Github: https://github.com/AI4Finance-Foundation/FinRL/issues

ESG-VARIABLES-PENALIZING
* **Pytorch Version**

# Content

* [1. Problem Definition](#0)
* [2. Getting Started - Load Python packages](#1)
    * [2.1. Install Packages](#1.1)    
    * [2.2. Check Additional Packages](#1.2)
    * [2.3. Import Packages](#1.3)
    * [2.4. Create Folders](#1.4)
* [3. Download Data](#2)
* [4. Preprocess Data](#3)        
    * [4.1. Technical Indicators](#3.1)
    * [4.2. Perform Feature Engineering](#3.2)
* [5.Build Environment](#4)  
    * [5.1. Training & Trade Data Split](#4.1)
    * [5.2. User-defined Environment](#4.2)   
    * [5.3. Initialize Environment](#4.3)    
* [6.Implement DRL Algorithms](#5)  
* [7.Backtesting Performance](#6)  
    * [7.1. BackTestStats](#6.1)
    * [7.2. BackTestPlot](#6.2)   
    * [7.3. Baseline Stats](#6.3)   
    * [7.3. Compare to Stock Market Index](#6.4)             

<a id='0'></a>
# Part 1. Problem Definition

This problem is to design an automated trading solution for portfolio alloacation. We model the stock trading process as a Markov Decision Process (MDP). We then formulate our trading goal as a maximization problem.

The algorithm is trained using Deep Reinforcement Learning (DRL) algorithms and the components of the reinforcement learning environment are:


* Action: The action space describes the allowed actions that the agent interacts with the
environment. Normally, a ∈ A represents the weight of a stock in the porfolio: a ∈ (-1,1). Assume our stock pool includes N stocks, we can use a list [a<sub>1</sub>, a<sub>2</sub>, ... , a<sub>N</sub>] to determine the weight for each stock in the porfotlio, where a<sub>i</sub> ∈ (-1,1), a<sub>1</sub>+ a<sub>2</sub>+...+a<sub>N</sub>=1. For example, "The weight of AAPL in the portfolio is 10%." is [0.1 , ...].

* Reward function: r(s, a, s′) is the incentive mechanism for an agent to learn a better action. The change of the portfolio value when action a is taken at state s and arriving at new state s',  i.e., r(s, a, s′) = v′ − v, where v′ and v represent the portfolio
values at state s′ and s, respectively

* State: The state space describes the observations that the agent receives from the environment. Just as a human trader needs to analyze various information before executing a trade, so
our trading agent observes many different features to better learn in an interactive environment.

* Environment: Dow 30 consituents


The data of the single stock that we will be using for this case study is obtained from Yahoo Finance API. The data contains Open-High-Low-Close price and volume.


<a id='1'></a>
# Part 2. Getting Started- Load Python Packages

<a id='1.1'></a>
## 2.1. Install all the packages through FinRL library



<a id='1.2'></a>
## 2.2. Check if the additional packages needed are present, if not install them.
* Yahoo Finance API
* pandas
* numpy
* matplotlib
* stockstats
* OpenAI gym
* stable-baselines
* tensorflow
* pyfolio

<a id='1.3'></a>
## 2.3. Import Packages

In [1]:
!pip install stockstats
!pip install hyperopt
!pip install pyfolio
import stockstats
from hyperopt import fmin, tpe, hp, Trials, space_eval
import pyfolio
from collections import deque

Collecting empyrical>=0.5.0 (from pyfolio)
  Using cached empyrical-0.5.5-py3-none-any.whl
Installing collected packages: empyrical
  Attempting uninstall: empyrical
    Found existing installation: empyrical 0.3.4
    Uninstalling empyrical-0.3.4:
      Successfully uninstalled empyrical-0.3.4
Successfully installed empyrical-0.5.5




<a id='1.4'></a>
## 2.4. FinRL Offline Scripts

<a id='1.4'></a>
## 2.4.1. Yahoo Downloader (from finrl.meta.preprocessor.yahoodownloader import YahooDownloader)

In [2]:
"""Contains methods and classes to collect data from
Yahoo Finance API
"""

from __future__ import annotations

import pandas as pd
import yfinance as yf


class YahooDownloader:
    """Provides methods for retrieving daily stock data from
    Yahoo Finance API

    Attributes
    ----------
        start_date : str
            start date of the data (modified from neofinrl_config.py)
        end_date : str
            end date of the data (modified from neofinrl_config.py)
        ticker_list : list
            a list of stock tickers (modified from neofinrl_config.py)

    Methods
    -------
    fetch_data()
        Fetches data from yahoo API

    """

    def __init__(self, start_date: str, end_date: str, ticker_list: list):
        self.start_date = start_date
        self.end_date = end_date
        self.ticker_list = ticker_list

    def fetch_data(self, proxy=None, auto_adjust=False) -> pd.DataFrame:
        """Fetches data from Yahoo API
        Parameters
        ----------

        Returns
        -------
        `pd.DataFrame`
            7 columns: A date, open, high, low, close, volume and tick symbol
            for the specified stock ticker
        """
        # Download and save the data in a pandas DataFrame:
        data_df = pd.DataFrame()
        num_failures = 0
        for tic in self.ticker_list:
            temp_df = yf.download(
                tic,
                start=self.start_date,
                end=self.end_date,
                proxy=proxy,
                auto_adjust=auto_adjust,
            )
            if temp_df.columns.nlevels != 1:
                temp_df.columns = temp_df.columns.droplevel(1)
            temp_df["tic"] = tic
            if len(temp_df) > 0:
                # data_df = data_df.append(temp_df)
                data_df = pd.concat([data_df, temp_df], axis=0)
            else:
                num_failures += 1
        if num_failures == len(self.ticker_list):
            raise ValueError("no data is fetched.")
        # reset the index, we want to use numbers as index instead of dates
        data_df = data_df.reset_index()
        try:
            # convert the column names to standardized names
            data_df.rename(
                columns={
                    "Date": "date",
                    "Adj Close": "adjcp",
                    "Close": "close",
                    "High": "high",
                    "Low": "low",
                    "Volume": "volume",
                    "Open": "open",
                    "tic": "tic",
                },
                inplace=True,
            )

            # use adjusted close price instead of close price
            data_df["close"] = data_df["adjcp"]
            # drop the adjusted close price column
            data_df = data_df.drop(labels="adjcp", axis=1)
        except NotImplementedError:
            print("the features are not supported currently")
        # create day of the week column (monday = 0)
        data_df["day"] = data_df["date"].dt.dayofweek
        # convert date to standard string format, easy to filter
        data_df["date"] = data_df.date.apply(lambda x: x.strftime("%Y-%m-%d"))
        # drop missing data
        data_df = data_df.dropna()
        data_df = data_df.reset_index(drop=True)
        print("Shape of DataFrame: ", data_df.shape)
        # print("Display DataFrame: ", data_df.head())

        data_df = data_df.sort_values(by=["date", "tic"]).reset_index(drop=True)

        return data_df

    def select_equal_rows_stock(self, df):
        df_check = df.tic.value_counts()
        df_check = pd.DataFrame(df_check).reset_index()
        df_check.columns = ["tic", "counts"]
        mean_df = df_check.counts.mean()
        equal_list = list(df.tic.value_counts() >= mean_df)
        names = df.tic.value_counts().index
        select_stocks_list = list(names[equal_list])
        df = df[df.tic.isin(select_stocks_list)]
        return df

<a id='1.4'></a>
## 2.4.2. Data Split (from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split)

In [3]:
import datetime
import numpy as np
import pandas as pd
from multiprocessing.sharedctypes import Value

import numpy as np
import pandas as pd
from stockstats import StockDataFrame as Sdf

def load_dataset(*, file_name: str) -> pd.DataFrame:
    """
    load csv dataset from path
    :return: (df) pandas dataframe
    """
    # _data = pd.read_csv(f"{config.DATASET_DIR}/{file_name}")
    _data = pd.read_csv(file_name)
    return _data


def data_split(df, start, end, target_date_col="date"):
    """
    split the dataset into training or testing using date
    :param data: (df) pandas dataframe, start, end
    :return: (df) pandas dataframe
    """
    data = df[(df[target_date_col] >= start) & (df[target_date_col] < end)]
    data = data.sort_values([target_date_col, "tic"], ignore_index=True)
    data.index = data[target_date_col].factorize()[0]
    return data


def convert_to_datetime(time):
    time_fmt = "%Y-%m-%dT%H:%M:%S"
    if isinstance(time, str):
        return datetime.datetime.strptime(time, time_fmt)

<a id='1.4'></a>
## 2.4.3. Backtesting Functions (from finrl.plot import backtest_stats, backtest_plot, get_daily_return, get_baseline,convert_daily_return_to_pyfolio_ts)

In [4]:
from __future__ import annotations

import copy
import datetime
from copy import deepcopy
import empyrical as ep

import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyfolio
from pyfolio import timeseries
import itertools

# Replacing from pyfolio import timeseries with original codes ##

def gross_lev(positions):
    """
    Calculates the gross leverage of a strategy.

    Parameters
    ----------
    positions : pd.DataFrame
        Daily net position values.
         - See full explanation in tears.create_full_tear_sheet.

    Returns
    -------
    pd.Series
        Gross leverage.
    """

    exposure = positions.drop('cash', axis=1).abs().sum(axis=1)
    return exposure / positions.sum(axis=1)

def get_txn_vol(transactions):
    """
    Extract daily transaction data from set of transaction objects.

    Parameters
    ----------
    transactions : pd.DataFrame
        Time series containing one row per symbol (and potentially
        duplicate datetime indices) and columns for amount and
        price.

    Returns
    -------
    pd.DataFrame
        Daily transaction volume and number of shares.
         - See full explanation in tears.create_full_tear_sheet.
    """

    txn_norm = transactions.copy()
    txn_norm.index = txn_norm.index.normalize()
    amounts = txn_norm.amount.abs()
    prices = txn_norm.price
    values = amounts * prices
    daily_amounts = amounts.groupby(amounts.index).sum()
    daily_values = values.groupby(values.index).sum()
    daily_amounts.name = "txn_shares"
    daily_values.name = "txn_volume"
    return pd.concat([daily_values, daily_amounts], axis=1)

def get_turnover(positions, transactions, denominator='AGB'):
    """
     - Value of purchases and sales divided
    by either the actual gross book or the portfolio value
    for the time step.

    Parameters
    ----------
    positions : pd.DataFrame
        Contains daily position values including cash.
        - See full explanation in tears.create_full_tear_sheet
    transactions : pd.DataFrame
        Prices and amounts of executed trades. One row per trade.
        - See full explanation in tears.create_full_tear_sheet
    denominator : str, optional
        Either 'AGB' or 'portfolio_value', default AGB.
        - AGB (Actual gross book) is the gross market
        value (GMV) of the specific algo being analyzed.
        Swapping out an entire portfolio of stocks for
        another will yield 200% turnover, not 100%, since
        transactions are being made for both sides.
        - We use average of the previous and the current end-of-period
        AGB to avoid singularities when trading only into or
        out of an entire book in one trading period.
        - portfolio_value is the total value of the algo's
        positions end-of-period, including cash.

    Returns
    -------
    turnover_rate : pd.Series
        timeseries of portfolio turnover rates.
    """

    txn_vol = get_txn_vol(transactions)
    traded_value = txn_vol.txn_volume

    if denominator == 'AGB':
        # Actual gross book is the same thing as the algo's GMV
        # We want our denom to be avg(AGB previous, AGB current)
        AGB = positions.drop('cash', axis=1).abs().sum(axis=1)
        denom = AGB.rolling(2).mean()

        # Since the first value of pd.rolling returns NaN, we
        # set our "day 0" AGB to 0.
        denom.iloc[0] = AGB.iloc[0] / 2
    elif denominator == 'portfolio_value':
        denom = positions.sum(axis=1)
    else:
        raise ValueError(
            "Unexpected value for denominator '{}'. The "
            "denominator parameter must be either 'AGB'"
            " or 'portfolio_value'.".format(denominator)
        )

    denom.index = denom.index.normalize()
    turnover = traded_value.div(denom, axis='index')
    turnover = turnover.fillna(0)
    return turnover

SIMPLE_STAT_FUNCS = [
    ep.annual_return,
    ep.cum_returns_final,
    ep.annual_volatility,
    ep.sharpe_ratio,
    ep.calmar_ratio,
    ep.stability_of_timeseries,
    # ep.max_drawdown,
    ep.omega_ratio,
    # ep.sortino_ratio,
    # stats.skew,
    # stats.kurtosis,
    # ep.tail_ratio,
    # value_at_risk
]

FACTOR_STAT_FUNCS = [
    # ep.alpha,
    # ep.beta,
]

STAT_FUNC_NAMES = {
    'annual_return': 'Annual return',
    'cum_returns_final': 'Cumulative returns',
    'annual_volatility': 'Annual volatility',
    'sharpe_ratio': 'Sharpe ratio',
    'calmar_ratio': 'Calmar ratio',
    'stability_of_timeseries': 'Stability',
    # 'max_drawdown': 'Max drawdown',
    'omega_ratio': 'Omega ratio',
    # 'sortino_ratio': 'Sortino ratio',
    # 'skew': 'Skew',
    # 'kurtosis': 'Kurtosis',
    # 'tail_ratio': 'Tail ratio',
    # 'common_sense_ratio': 'Common sense ratio',
    # 'value_at_risk': 'Daily value at risk',
    # 'alpha': 'Alpha',
    # 'beta': 'Beta',
}


def perf_stats(returns, factor_returns=None, positions=None,
               transactions=None, turnover_denom='AGB'):
    """
    Calculates various performance metrics of a strategy, for use in
    plotting.show_perf_stats.

    Parameters
    ----------
    returns : pd.Series
        Daily returns of the strategy, noncumulative.
         - See full explanation in tears.create_full_tear_sheet.
    factor_returns : pd.Series, optional
        Daily noncumulative returns of the benchmark factor to which betas are
        computed. Usually a benchmark such as market returns.
         - This is in the same style as returns.
         - If None, do not compute alpha, beta, and information ratio.
    positions : pd.DataFrame
        Daily net position values.
         - See full explanation in tears.create_full_tear_sheet.
    transactions : pd.DataFrame
        Prices and amounts of executed trades. One row per trade.
        - See full explanation in tears.create_full_tear_sheet.
    turnover_denom : str
        Either AGB or portfolio_value, default AGB.
        - See full explanation in txn.get_turnover.

    Returns
    -------
    pd.Series
        Performance metrics.
    """

    stats = pd.Series()
    for stat_func in SIMPLE_STAT_FUNCS:
        stats[STAT_FUNC_NAMES[stat_func.__name__]] = stat_func(returns)

    if positions is not None:
        stats['Gross leverage'] = gross_lev(positions).mean()
        if transactions is not None:
            stats['Daily turnover'] = get_turnover(positions,
                                                   transactions,
                                                   turnover_denom).mean()
    if factor_returns is not None:
        for stat_func in FACTOR_STAT_FUNCS:
            res = stat_func(returns, factor_returns)
            stats[STAT_FUNC_NAMES[stat_func.__name__]] = res

    return stats
#######################
def date2str(dat: datetime.date) -> str:
    return datetime.date.strftime(dat, "%Y-%m-%d")

def str2date(dat: str) -> datetime.date:
    return datetime.datetime.strptime(dat, "%Y-%m-%d").date()

def get_daily_return(df, value_col_name="account_value"):
    df = deepcopy(df)
    df["daily_return"] = df[value_col_name].pct_change(1)
    df["date"] = pd.to_datetime(df["date"])
    df.set_index("date", inplace=True, drop=True)
    df.index = df.index.tz_localize("UTC")
    return pd.Series(df["daily_return"], index=df.index)


def convert_daily_return_to_pyfolio_ts(df):
    strategy_ret = df.copy()
    strategy_ret["date"] = pd.to_datetime(strategy_ret["date"])
    strategy_ret.set_index("date", drop=False, inplace=True)
    strategy_ret.index = strategy_ret.index.tz_localize("UTC")
    del strategy_ret["date"]
    return pd.Series(strategy_ret["daily_return"].values, index=strategy_ret.index)


# def backtest_stats(account_value, value_col_name="account_value"):
#     dr_test = get_daily_return(account_value, value_col_name=value_col_name)
#     perf_stats_all = timeseries.perf_stats(
#         returns=dr_test,
#         positions=None,
#         transactions=None,
#         turnover_denom="AGB",
#     )
#     print(perf_stats_all)
#     return perf_stats_all

def backtest_stats(account_value, value_col_name="account_value"):
    dr_test = get_daily_return(account_value, value_col_name=value_col_name)
    perf_stats_all = perf_stats(
        returns=dr_test,
        positions=None,
        transactions=None,
        turnover_denom="AGB",
    )
    print(perf_stats_all)
    return perf_stats_all


# def backtest_plot(
#     account_value,
#     baseline_start=TRADE_START_DATE,
#     baseline_end=TRADE_END_DATE,
#     baseline_ticker="^DJI",
#     value_col_name="account_value",
# ):
#     df = deepcopy(account_value)
#     df["date"] = pd.to_datetime(df["date"])
#     test_returns = get_daily_return(df, value_col_name=value_col_name)

#     baseline_df = get_baseline(
#         ticker=baseline_ticker, start=baseline_start, end=baseline_end
#     )

#     baseline_df["date"] = pd.to_datetime(baseline_df["date"], format="%Y-%m-%d")
#     baseline_df = pd.merge(df[["date"]], baseline_df, how="left", on="date")
#     baseline_df = baseline_df.fillna(method="ffill").fillna(method="bfill")
#     baseline_returns = get_daily_return(baseline_df, value_col_name="close")

#     with pyfolio.plotting.plotting_context(font_scale=1.1):
#         pyfolio.create_full_tear_sheet(
#             returns=test_returns, benchmark_rets=baseline_returns, set_context=False
#         )


def get_baseline(ticker, start, end):
    return YahooDownloader(
        start_date=start, end_date=end, ticker_list=[ticker]
    ).fetch_data()


def trx_plot(df_trade, df_actions, ticker_list):
    df_trx = pd.DataFrame(np.array(df_actions["transactions"].to_list()))
    df_trx.columns = ticker_list
    df_trx.index = df_actions["date"]
    df_trx.index.name = ""

    for i in range(df_trx.shape[1]):
        df_trx_temp = df_trx.iloc[:, i]
        df_trx_temp_sign = np.sign(df_trx_temp)
        buying_signal = df_trx_temp_sign.apply(lambda x: x > 0)
        selling_signal = df_trx_temp_sign.apply(lambda x: x < 0)

        tic_plot = df_trade[
            (df_trade["tic"] == df_trx_temp.name)
            & (df_trade["date"].isin(df_trx.index))
        ]["close"]
        tic_plot.index = df_trx_temp.index

        plt.figure(figsize=(10, 8))
        plt.plot(tic_plot, color="g", lw=2.0)
        plt.plot(
            tic_plot,
            "^",
            markersize=10,
            color="m",
            label="buying signal",
            markevery=buying_signal,
        )
        plt.plot(
            tic_plot,
            "v",
            markersize=10,
            color="k",
            label="selling signal",
            markevery=selling_signal,
        )
        plt.title(
            f"{df_trx_temp.name} Num Transactions: {len(buying_signal[buying_signal == True]) + len(selling_signal[selling_signal == True])}"
        )
        plt.legend()
        plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=25))
        plt.xticks(rotation=45, ha="right")
        plt.show()


# 2022-01-15 -> 01/15/2022
def transfer_date(str_dat):
    return datetime.datetime.strptime(str_dat, "%Y-%m-%d").date().strftime("%m/%d/%Y")


def plot_result_from_csv(
    csv_file: str,
    column_as_x: str,
    savefig_filename: str = "fig/result.png",
    xlabel: str = "Date",
    ylabel: str = "Result",
    num_days_xticks: int = 20,
    xrotation: int = 0,
):
    result = pd.read_csv(csv_file)
    plot_result(
        result,
        column_as_x,
        savefig_filename,
        xlabel,
        ylabel,
        num_days_xticks,
        xrotation,
    )


# select_start_date: included
# select_end_date: included
# is if_need_calc_return is True, it is account_value, and then transfer it to return
# it is better that column_as_x is the first column, and the other columns are strategies
# xrotation: the rotation of xlabel, may be used in dates. Default=0 (adaptive adjustment)
def plot_result(
    result: pd.DataFrame(),
    column_as_x: str,
    savefig_filename: str = "fig/result.png",
    xlabel: str = "Date",
    ylabel: str = "Result",
    num_days_xticks: int = 20,
    xrotation: int = 0,
):
    columns = result.columns
    columns_strtegy = []
    for i in range(len(columns)):
        col = columns[i]
        if "Unnamed" not in col and col != column_as_x:
            columns_strtegy.append(col)

    result.reindex()

    x = result[column_as_x].values.tolist()
    plt.rcParams["figure.figsize"] = (15, 6)
    # plt.figure()

    fig, ax = plt.subplots()
    colors = [
        "black",
        "red",
        "green",
        "blue",
        "cyan",
        "magenta",
        "yellow",
        "aliceblue",
        "coral",
        "darksalmon",
        "firebrick",
        "honeydew",
    ]
    for i in range(len(columns_strtegy)):
        col = columns_strtegy[i]
        ax.plot(
            x,
            result[col],
            color=colors[i],
            linewidth=1,
            linestyle="-",
        )

    plt.title("", fontsize=20)
    plt.xlabel(xlabel, fontsize=20)
    plt.ylabel(ylabel, fontsize=20)

    plt.legend(labels=columns_strtegy, loc="best", fontsize=16)

    # set grid
    plt.grid()

    plt.xticks(size=22)  # 设置刻度大小
    plt.yticks(size=22)  # 设置刻度大小

    # #设置每隔多少距离⼀个刻度
    # plt.xticks(x[::60])

    # # 设置每月定位符
    # if if_set_x_monthlocator:
    #     ax.xaxis.set_major_locator(mdates.MonthLocator())  # interval = 1

    # 设置每隔多少距离⼀个刻度
    plt.xticks(x[::num_days_xticks])

    plt.setp(ax.get_xticklabels(), rotation=xrotation, horizontalalignment="center")

    # 为防止x轴label重叠，自动调整label旋转角度
    if xrotation == 0:
        if_overlap = get_if_overlap(fig, ax)

        if if_overlap == True:
            plt.gcf().autofmt_xdate(ha="right")  # ⾃动旋转⽇期标记

    plt.tight_layout()  # 自动调整子图间距

    plt.savefig(savefig_filename)

    plt.show()


def get_if_overlap(fig, ax):
    fig.canvas.draw()
    # 获取日期标签的边界框
    bboxes = [label.get_window_extent() for label in ax.get_xticklabels()]
    # 计算日期标签之间的距离
    distances = [bboxes[i + 1].x0 - bboxes[i].x1 for i in range(len(bboxes) - 1)]
    # 如果有任何距离小于0，说明有重叠
    if any(distance < 0 for distance in distances):
        if_overlap = True
    else:
        if_overlap = False

    return if_overlap


def plot_return(
    result: pd.DataFrame(),
    column_as_x: str,
    if_need_calc_return: bool,
    savefig_filename: str = "fig/result.png",
    xlabel: str = "Date",
    ylabel: str = "Return",
    if_transfer_date: bool = True,
    select_start_date: str = None,
    select_end_date: str = None,
    num_days_xticks: int = 20,
    xrotation: int = 0,
):
    if select_start_date is None:
        select_start_date: str = result[column_as_x].iloc[0]
        select_end_date: str = result[column_as_x].iloc[-1]
    # calc returns if if_need_calc_return is True, so that result stores returns
    select_start_date_index = result[column_as_x].tolist().index(select_start_date)
    columns = result.columns
    columns_strtegy = []
    column_as_x_index = None
    for i in range(len(columns)):
        col = columns[i]
        if col == column_as_x:
            column_as_x_index = i
        elif "Unnamed" not in col:
            columns_strtegy.append(col)
            if if_need_calc_return:
                result[col] = result[col] / result[col][select_start_date_index] - 1

    # select the result between select_start_date and select_end_date
    # if date is 2020-01-15, transfer it to 01/15/2020
    num_rows, num_cols = result.shape
    tmp_result = copy.deepcopy(result)
    result = pd.DataFrame()
    if_first_row = True
    columns = []
    for i in range(num_rows):
        if (
            str2date(select_start_date)
            <= str2date(tmp_result[column_as_x][i])
            <= str2date(select_end_date)
        ):
            if "-" in tmp_result.iloc[i][column_as_x] and if_transfer_date:
                new_date = transfer_date(tmp_result.iloc[i][column_as_x])
            else:
                new_date = tmp_result.iloc[i][column_as_x]
            tmp_result.iloc[i, column_as_x_index] = new_date
            # print("tmp_result.iloc[i]: ", tmp_result.iloc[i])
            # result = result.append(tmp_result.iloc[i])
            if if_first_row:
                columns = tmp_result.iloc[i].index.tolist()
                result = pd.DataFrame(columns=columns)
                # result = pd.concat([result, tmp_result.iloc[i]], axis=1)
                # result = pd.DataFrame(tmp_result.iloc[i])
                # result.columns = tmp_result.iloc[i].index.tolist()
                if_first_row = False
            row = pd.DataFrame([tmp_result.iloc[i].tolist()], columns=columns)
            result = pd.concat([result, row], axis=0)

    # print final return of each strategy
    final_return = {}
    for col in columns_strtegy:
        final_return[col] = result.iloc[-1][col]
    print("final return: ", final_return)

    result.reindex()

    plot_result(
        result=result,
        column_as_x=column_as_x,
        savefig_filename=savefig_filename,
        xlabel=xlabel,
        ylabel=ylabel,
        num_days_xticks=num_days_xticks,
        xrotation=xrotation,
    )


def plot_return_from_csv(
    csv_file: str,
    column_as_x: str,
    if_need_calc_return: bool,
    savefig_filename: str = "fig/result.png",
    xlabel: str = "Date",
    ylabel: str = "Return",
    if_transfer_date: bool = True,
    select_start_date: str = None,
    select_end_date: str = None,
    num_days_xticks: int = 20,
    xrotation: int = 0,
):
    result = pd.read_csv(csv_file)
    plot_return(
        result,
        column_as_x,
        if_need_calc_return,
        savefig_filename,
        xlabel,
        ylabel,
        if_transfer_date,
        select_start_date,
        select_end_date,
        num_days_xticks,
        xrotation,
    )

In [5]:
import copy
import datetime
import os
from datetime import date
from datetime import timedelta
from typing import List
from typing import Tuple

import numpy as np
import pandas as pd

<a id='2'></a>
# Part 3. Download Data
Yahoo Finance is a website that provides stock data, financial news, financial reports, etc. All the data provided by Yahoo Finance is free.
* FinRL uses a class **YahooDownloader** to fetch data from Yahoo Finance API
* Call Limit: Using the Public API (without authentication), you are limited to 2,000 requests per hour per IP (or up to a total of 48,000 requests a day).


In [6]:
# Indian Sensex 33
sensex_ticker = ["ASIANPAINT.NS", "AXISBANK.NS", "BAJFINANCE.NS", "BAJAJFINSV.NS", "BHARTIARTL.NS", "HCLTECH.NS", "HDFCBANK.NS",
                 "HINDUNILVR.NS", "ICICIBANK.NS", "INDUSINDBK.NS", "INFY.NS", "ITC.NS", "JSWSTEEL.NS", "KOTAKBANK.NS", "LT.NS",
                 "M&M.NS", "MARUTI.NS", "NESTLEIND.NS", "NTPC.NS", "POWERGRID.NS", "RELIANCE.NS", "SBIN.NS", "SUNPHARMA.NS",
                 "TATAMOTORS.NS", "TATASTEEL.NS", "TCS.NS", "TECHM.NS", "TITAN.NS", "ULTRACEMCO.NS", "WIPRO.NS"]

Nifty_50 = ['ADANIENT.NS', 'ADANIPORTS.NS', 'APOLLOHOSP.NS', 'ASIANPAINT.NS',
       'AXISBANK.NS', 'BAJAJ-AUTO.NS', 'BAJAJFINSV.NS', 'BAJFINANCE.NS',
       'BHARTIARTL.NS', 'BPCL.NS', 'BRITANNIA.NS', 'CIPLA.NS', 'COALINDIA.NS',
       'DIVISLAB.NS', 'DRREDDY.NS', 'EICHERMOT.NS', 'GRASIM.NS', 'HCLTECH.NS',
       'HDFCBANK.NS', 'HEROMOTOCO.NS', 'HINDALCO.NS', 'HINDUNILVR.NS',
       'ICICIBANK.NS', 'INDUSINDBK.NS', 'ITC.NS', 'JSWSTEEL.NS',
       'KOTAKBANK.NS', 'LT.NS', 'M&M.NS', 'MARUTI.NS', 'NESTLEIND.NS',
       'NTPC.NS', 'ONGC.NS', 'POWERGRID.NS', 'RELIANCE.NS', 'SBIN.NS',
       'SUNPHARMA.NS', 'TATACONSUM.NS', 'TATAMOTORS.NS', 'TATASTEEL.NS',
       'TCS.NS', 'TECHM.NS', 'TITAN.NS', 'UPL.NS', 'WIPRO.NS', "INFY.NS", "ULTRACEMCO.NS"]

In [7]:
!pip install setuptools==66
!pip install stockstats
!pip install hyperopt
# !pip install pyfolio
import stockstats
from hyperopt import fmin, tpe, hp, Trials, space_eval
# import pyfolio
from collections import deque



In [8]:
"""Contains methods and classes to collect data from
Yahoo Finance API
"""

from __future__ import annotations

import pandas as pd
import yfinance as yf


class YahooDownloader:
    """Provides methods for retrieving daily stock data from
    Yahoo Finance API

    Attributes
    ----------
        start_date : str
            start date of the data (modified from neofinrl_config.py)
        end_date : str
            end date of the data (modified from neofinrl_config.py)
        ticker_list : list
            a list of stock tickers (modified from neofinrl_config.py)

    Methods
    -------
    fetch_data()
        Fetches data from yahoo API

    """

    def __init__(self, start_date: str, end_date: str, ticker_list: list):
        self.start_date = start_date
        self.end_date = end_date
        self.ticker_list = ticker_list

    def fetch_data(self, proxy=None, auto_adjust=False) -> pd.DataFrame:
        """Fetches data from Yahoo API
        Parameters
        ----------

        Returns
        -------
        `pd.DataFrame`
            7 columns: A date, open, high, low, close, volume and tick symbol
            for the specified stock ticker
        """
        # Download and save the data in a pandas DataFrame:
        data_df = pd.DataFrame()
        num_failures = 0
        for tic in self.ticker_list:
            temp_df = yf.download(
                tic,
                start=self.start_date,
                end=self.end_date,
                proxy=proxy,
                auto_adjust=auto_adjust,
            )
            if temp_df.columns.nlevels != 1:
                temp_df.columns = temp_df.columns.droplevel(1)
            temp_df["tic"] = tic
            if len(temp_df) > 0:
                # data_df = data_df.append(temp_df)
                data_df = pd.concat([data_df, temp_df], axis=0)
            else:
                num_failures = num_failures+ 1
        if num_failures == len(self.ticker_list):
            raise ValueError("no data is fetched.")
        # reset the index, we want to use numbers as index instead of dates
        data_df = data_df.reset_index()
        try:
            # convert the column names to standardized names
            data_df.rename(
                columns={
                    "Date": "date",
                    "Adj Close": "adjcp",
                    "Close": "close",
                    "High": "high",
                    "Low": "low",
                    "Volume": "volume",
                    "Open": "open",
                    "tic": "tic",
                },
                inplace=True,
            )

            # use adjusted close price instead of close price
            data_df["close"] = data_df["adjcp"]
            # drop the adjusted close price column
            data_df = data_df.drop(labels="adjcp", axis=1)
        except NotImplementedError:
            print("the features are not supported currently")
        # create day of the week column (monday = 0)
        data_df["day"] = data_df["date"].dt.dayofweek
        # convert date to standard string format, easy to filter
        data_df["date"] = data_df.date.apply(lambda x: x.strftime("%Y-%m-%d"))
        # drop missing data
        data_df = data_df.dropna()
        data_df = data_df.reset_index(drop=True)
        print("Shape of DataFrame: ", data_df.shape)
        # print("Display DataFrame: ", data_df.head())

        data_df = data_df.sort_values(by=["date", "tic"]).reset_index(drop=True)

        return data_df

    def select_equal_rows_stock(self, df):
        df_check = df.tic.value_counts()
        df_check = pd.DataFrame(df_check).reset_index()
        df_check.columns = ["tic", "counts"]
        mean_df = df_check.counts.mean()
        equal_list = list(df.tic.value_counts() >= mean_df)
        names = df.tic.value_counts().index
        select_stocks_list = list(names[equal_list])
        df = df[df.tic.isin(select_stocks_list)]
        return df

In [9]:
import datetime
import numpy as np
import pandas as pd
from multiprocessing.sharedctypes import Value

import numpy as np
import pandas as pd
from stockstats import StockDataFrame as Sdf

def load_dataset(*, file_name: str) -> pd.DataFrame:
    """
    load csv dataset from path
    :return: (df) pandas dataframe
    """
    # _data = pd.read_csv(f"{config.DATASET_DIR}/{file_name}")
    _data = pd.read_csv(file_name)
    return _data


def data_split(df, start, end, target_date_col="date"):
    """
    split the dataset into training or testing using date
    :param data: (df) pandas dataframe, start, end
    :return: (df) pandas dataframe
    """
    data = df[(df[target_date_col] >= start) & (df[target_date_col] < end)]
    data = data.sort_values([target_date_col, "tic"], ignore_index=True)
    data.index = data[target_date_col].factorize()[0]
    return data


def convert_to_datetime(time):
    time_fmt = "%Y-%m-%dT%H:%M:%S"
    if isinstance(time, str):
        return datetime.datetime.strptime(time, time_fmt)

In [10]:
import copy
import datetime
import os
from datetime import date
from datetime import timedelta
from typing import List
from typing import Tuple

import numpy as np
import pandas as pd

<a id='1.1'></a>
## 2.1. Install all the packages through FinRL library



In [11]:
!pip install hyperopt
from hyperopt import fmin, tpe, hp, Trials, space_eval



<a id='2'></a>
# Part 3. Download Data
Yahoo Finance is a website that provides stock data, financial news, financial reports, etc. All the data provided by Yahoo Finance is free.
* FinRL uses a class **YahooDownloader** to fetch data from Yahoo Finance API
* Call Limit: Using the Public API (without authentication), you are limited to 2,000 requests per hour per IP (or up to a total of 48,000 requests a day).


In [12]:
Market = 'Nifty_50'
Reward = 'LSTM'
BL = '^NSEI'
#BL = '^CNX100'

In [13]:
# Download and save the data in a pandas DataFrame:
df = YahooDownloader(start_date = '2011-01-01',
                     end_date = '2025-03-31',
                     ticker_list = Nifty_50).fetch_data()

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%********

Shape of DataFrame:  (165064, 8)


In [14]:
df.shape
df

Price,date,close,high,low,open,volume,tic,day
0,2011-01-03,83.078514,101.130882,98.862411,100.395164,877846,ADANIENT.NS,0
1,2011-01-03,134.131119,146.399994,143.050003,145.550003,487210,ADANIPORTS.NS,0
2,2011-01-03,437.000702,469.950012,458.000000,459.000000,51291,APOLLOHOSP.NS,0
3,2011-01-03,256.154358,293.739990,286.005005,289.799988,454300,ASIANPAINT.NS,0
4,2011-01-03,251.461945,274.399994,268.459991,273.000000,5266100,AXISBANK.NS,0
...,...,...,...,...,...,...,...,...
165059,2025-03-28,1418.250000,1430.000000,1408.550049,1415.099976,1642693,TECHM.NS,4
165060,2025-03-28,3063.350098,3111.800049,3051.050049,3094.000000,778417,TITAN.NS,4
165061,2025-03-28,11509.549805,11699.000000,11458.650391,11585.500000,280960,ULTRACEMCO.NS,4
165062,2025-03-28,636.250000,660.400024,624.000000,656.650024,4421109,UPL.NS,4


## Our Feature Engineering

In [15]:
from stockstats import StockDataFrame as Sdf

def add_tech(data, INDICATORS):
  df = data.copy()
  df = df.sort_values(by=["tic", "date"])
  stock = Sdf.retype(df.copy())
  unique_ticker = stock.tic.unique()

  for indicator in INDICATORS:
      indicator_df = pd.DataFrame()
      for i in range(len(unique_ticker)):
          try:
              temp_indicator = stock[stock.tic == unique_ticker[i]][indicator]
              temp_indicator = pd.DataFrame(temp_indicator)
              temp_indicator["tic"] = unique_ticker[i]
              temp_indicator["date"] = df[df.tic == unique_ticker[i]][
                  "date"
              ].to_list()
              # indicator_df = indicator_df.append(
              #     temp_indicator, ignore_index=True
              # )
              indicator_df = pd.concat(
                  [indicator_df, temp_indicator], axis=0, ignore_index=True
              )
          except Exception as e:
              print(e)
      df = df.merge(
          indicator_df[["tic", "date", indicator]], on=["tic", "date"], how="left"
      )

  df = df.sort_values(by=["date", "tic"])
  return df

In [16]:
INDICATORS = ['macd', 'boll_ub', 'boll_lb', 'rsi_30', 'cci_30', 'dx_30', 'close_30_sma', 'close_60_sma']
df = add_tech(df, INDICATORS)
df = df.ffill().bfill()

In [18]:
# add covariance matrix as states
df=df.sort_values(['date','tic'],ignore_index=True)
df.index = df.date.factorize()[0]

cov_list = []
return_list = []

# look back is one year
lookback=252
for i in range(lookback,len(df.index.unique())):
  data_lookback = df.loc[i-lookback:i,:]
  price_lookback=data_lookback.pivot_table(index = 'date',columns = 'tic', values = 'close')
  return_lookback = price_lookback.pct_change().dropna()
  return_list.append(return_lookback)

  covs = return_lookback.cov().values
  cov_list.append(covs)


df_cov = pd.DataFrame({'date':df.date.unique()[lookback:],'cov_list':cov_list,'return_list':return_list})
df = df.merge(df_cov, on='date')
df = df.sort_values(['date','tic']).reset_index(drop=True)

In [19]:
df.head(5)

Unnamed: 0,date,close,high,low,open,volume,tic,day,macd,boll_ub,boll_lb,rsi_30,cci_30,dx_30,close_30_sma,close_60_sma,cov_list,return_list
0,2012-01-10,36.641663,43.760029,41.185009,41.185009,14022646,ADANIENT.NS,1,-2.567197,44.168874,33.781454,34.120872,-128.817245,39.740656,39.926283,47.257094,"[[0.0008295954814270387, 0.0002227298398581159...",tic ADANIENT.NS ADANIPORTS.NS APOLLO...
1,2012-01-10,127.016884,137.0,132.850006,133.850006,869679,ADANIPORTS.NS,1,-0.575367,125.065236,106.958192,51.531531,192.37252,8.68298,117.216412,127.663853,"[[0.0008295954814270387, 0.0002227298398581159...",tic ADANIENT.NS ADANIPORTS.NS APOLLO...
2,2012-01-10,557.239258,597.900024,574.0,575.099976,201647,APOLLOHOSP.NS,1,1.530863,581.333663,469.007817,53.572836,7.650272,5.583804,548.080473,528.434672,"[[0.0008295954814270387, 0.0002227298398581159...",tic ADANIENT.NS ADANIPORTS.NS APOLLO...
3,2012-01-10,239.22226,269.5,264.5,268.855011,338040,ASIANPAINT.NS,1,-4.970677,250.422287,230.752893,40.195494,-53.408598,30.113608,245.87075,261.232402,"[[0.0008295954814270387, 0.0002227298398581159...",tic ADANIENT.NS ADANIPORTS.NS APOLLO...
4,2012-01-10,166.384033,179.399994,173.220001,173.800003,9827090,AXISBANK.NS,1,-5.285489,175.857171,144.370972,44.932195,-19.455753,18.67729,168.103823,183.434647,"[[0.0008295954814270387, 0.0002227298398581159...",tic ADANIENT.NS ADANIPORTS.NS APOLLO...


In [20]:
print(df.shape)

hist_vol=[]
for i in range(len(df['return_list'])):
  returns = df['return_list'].values[i].std()
  hist_vol.append(returns)
print(len(hist_vol))

hist_vol= np.array(hist_vol)
# print(hist_vol.shape)
# print(hist_vol)
hist_vol= pd.DataFrame(hist_vol, df['date'])
# print(hist_vol.shape)
# print(df)

(153220, 18)
153220


<a id='4'></a>
# Part 5. Design Environment
Considering the stochastic and interactive nature of the automated stock trading tasks, a financial task is modeled as a **Markov Decision Process (MDP)** problem. The training process involves observing stock price change, taking an action and reward's calculation to have the agent adjusting its strategy accordingly. By interacting with the environment, the trading agent will derive a trading strategy with the maximized rewards as time proceeds.

Our trading environments, based on OpenAI Gym framework, simulate live stock markets with real market data according to the principle of time-driven simulation.


In [21]:
# %%capture
!pip install shimmy
!pip install stable_baselines3
!pip install gym



In [22]:
import numpy as np
import pandas as pd
from gym.utils import seeding
import gym
from gym import spaces
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from stable_baselines3.common.vec_env import DummyVecEnv


## Training data split: 2009-01-01 to 2020-07-01

In [23]:
TRAIN_START_DATE = '2011-01-01'
TRAIN_END_DATE = '2021-12-31'

# TRAIN_END_DATE = '2012-12-01'

Val_START_DATE = '2022-01-01'
VAL_END_DATE =  '2022-12-31'
TRADE_START_DATE = '2023-01-01'
TRADE_END_DATE = '2025-03-31'
# print(df[30:])
# hist_vol = hist_vol.reset_index(drop=True)

train = data_split(df, TRAIN_START_DATE,TRAIN_END_DATE)
hist_vol_train = hist_vol[TRAIN_START_DATE : TRAIN_END_DATE]

val = data_split(df, Val_START_DATE, VAL_END_DATE)
hist_vol_val=hist_vol[Val_START_DATE :VAL_END_DATE]

full_train = data_split(df, TRAIN_START_DATE, VAL_END_DATE)
hist_vol_full_train= hist_vol[TRAIN_START_DATE :VAL_END_DATE]


# full_train = data_split(df, TRAIN_START_DATE,TRAIN_END_DATE)
# hist_vol_full_train= hist_vol[TRAIN_START_DATE :TRAIN_END_DATE]

trade = data_split(df, TRADE_START_DATE,TRADE_END_DATE)
hist_vol_trade= hist_vol[TRADE_START_DATE  : TRADE_END_DATE]

print(full_train.shape)

(127229, 18)


In [24]:
from collections import Counter

values = val.index
set(Counter(values).values())

{47}

In [25]:
# Working Stocks
train.loc[0,:]['return_list'].values[0].columns

Index(['ADANIENT.NS', 'ADANIPORTS.NS', 'APOLLOHOSP.NS', 'ASIANPAINT.NS',
       'AXISBANK.NS', 'BAJAJ-AUTO.NS', 'BAJAJFINSV.NS', 'BAJFINANCE.NS',
       'BHARTIARTL.NS', 'BPCL.NS', 'BRITANNIA.NS', 'CIPLA.NS', 'COALINDIA.NS',
       'DIVISLAB.NS', 'DRREDDY.NS', 'EICHERMOT.NS', 'GRASIM.NS', 'HCLTECH.NS',
       'HDFCBANK.NS', 'HEROMOTOCO.NS', 'HINDALCO.NS', 'HINDUNILVR.NS',
       'ICICIBANK.NS', 'INDUSINDBK.NS', 'INFY.NS', 'ITC.NS', 'JSWSTEEL.NS',
       'KOTAKBANK.NS', 'LT.NS', 'M&M.NS', 'MARUTI.NS', 'NESTLEIND.NS',
       'NTPC.NS', 'ONGC.NS', 'POWERGRID.NS', 'RELIANCE.NS', 'SBIN.NS',
       'SUNPHARMA.NS', 'TATACONSUM.NS', 'TATAMOTORS.NS', 'TATASTEEL.NS',
       'TCS.NS', 'TECHM.NS', 'TITAN.NS', 'ULTRACEMCO.NS', 'UPL.NS',
       'WIPRO.NS'],
      dtype='object', name='tic')

In [26]:
len(train.loc[0,:]['return_list'].values[0].columns)

47

Here is the definition of the environment.

In [27]:
class StockPortfolioEnv(gym.Env):
    """A single stock trading environment for OpenAI gym

    Attributes
    ----------
        df: DataFrame
            input data
        stock_dim : int
            number of unique stocks
        hmax : int
            maximum number of shares to trade
        initial_amount : int
            start money
        transaction_cost_pct: float
            transaction cost percentage per trade
        reward_scaling: float
            scaling factor for reward, good for training
        state_space: int
            the dimension of input features
        action_space: int
            equals stock dimension
        tech_indicator_list: list
            a list of technical indicator names
        turbulence_threshold: int
            a threshold to control risk aversion
        day: int
            an increment number to control date

    Methods
    -------
    _sell_stock()
        perform sell action based on the sign of the action
    _buy_stock()
        perform buy action based on the sign of the action
    step()
        at each step the agent will return actions, then
        we will calculate the reward, and return the next observation.
    reset()
        reset the environment
    render()
        use render to return other functions
    save_asset_memory()
        return account value at each time step
    save_action_memory()
        return actions/positions at each time step


    """
    metadata = {'render.modes': ['human']}

    def __init__(self,
                df,
                stock_dim,
                hmax,
                initial_amount,
                transaction_cost_pct,
                reward_scaling,
                state_space,
                action_space,
                tech_indicator_list,
                Rebalance=False,
                turbulence_threshold=None,
                lookback=252,
                day = 0, hist_vol = None ):
        #super(StockEnv, self).__init__()
        #money = 10 , scope = 1
        self.day = day
        self.lookback=lookback
        self.df = df
        self.stock_dim = stock_dim
        self.hmax = hmax
        self.initial_amount = initial_amount
        self.transaction_cost_pct = transaction_cost_pct
        self.reward_scaling = reward_scaling
        self.state_space = state_space
        self.action_space = action_space
        self.tech_indicator_list = tech_indicator_list
        self.rebalance = Rebalance
        self.DSR_A = 0.0
        self.DSR_B = 0.0
        self.Return_queue = deque([0]*50, maxlen=50)
        self.hist_vol= hist_vol
        # action_space normalization and shape is self.stock_dim
        self.action_space = spaces.Box(low = 0, high = 1,shape = (self.action_space,))
        # Shape = (34, 30)
        # covariance matrix + technical indicators + ESG (4). Ojo, no funciona meter aqui el shape bueno. Esto puede causar problemas.

        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape = (self.state_space + 1 +len(self.tech_indicator_list), self.state_space))

        # load data from a pandas dataframe
        self.data = self.df.loc[self.day,:]
        self.covs = self.data['cov_list'].values[0]

        self.state = np.append(np.array(self.covs), [self.data[tech].values.tolist() for tech in self.tech_indicator_list ], axis=0)
        hist_vollll= self.hist_vol.values[self.day,:]
        self.state = np.concatenate([self.state, hist_vollll.reshape(1,-1) ], axis=0)


        self.terminal = False
        self.turbulence_threshold = turbulence_threshold
        # initalize state: inital portfolio return + individual stock return + individual weights
        self.portfolio_value = self.initial_amount

        # memorize portfolio value each step
        self.asset_memory = [self.initial_amount]
        # memorize portfolio return each step
        self.portfolio_return_memory = [0]
        self.actions_memory=[[1/self.stock_dim]*self.stock_dim]
        self.date_memory=[self.data.date.unique()[0]]


    def step(self, actions):
        # print(f" the len of the df is  {len(self.df.index.unique())}  and the current day is :  {self.day } and  if  terminal is  : { self.day >= len(self.df.index.unique()) - 1 }")

        self.terminal = self.day >= len(self.df.index.unique())-1
        # print(actions)

        if self.terminal:
            df = pd.DataFrame(self.portfolio_return_memory)
            df.columns = ['daily_return']
            # plt.plot(df.daily_return.cumsum(),'r')
            # plt.savefig('results/cumulative_reward.png')
            # plt.close()

            # plt.plot(self.portfolio_return_memory,'r')
            # plt.savefig('results/rewards.png')
            # plt.close()

            print("=================================")
            print("begin_total_asset:{}".format(self.asset_memory[0]))
            print("end_total_asset:{}".format(self.portfolio_value))

            df_daily_return = pd.DataFrame(self.portfolio_return_memory)
            df_daily_return.columns = ['daily_return']
            if df_daily_return['daily_return'].std() !=0:
              sharpe = (252**0.5)*df_daily_return['daily_return'].mean()/ \
                       df_daily_return['daily_return'].std()
              print("Sharpe: ",sharpe)
            print("=================================")

            return self.state, self.reward, self.terminal,{}

        else:
            #print("Model actions: ",actions)
            # actions are the portfolio weight
            # normalize to sum of 1
            #if (np.array(actions) - np.array(actions).min()).sum() != 0:
            #  norm_actions = (np.array(actions) - np.array(actions).min()) / (np.array(actions) - np.array(actions).min()).sum()
            #else:
            #  norm_actions = actions
            weights = self.softmax_normalization(actions)

            ## Repair Mechanism
            # weights = self.repair_portfolio(actions, 15, 0.01, 1)
            # print('Weights', weights)
            # print('Weights',np.sum(weights))
            # ## Rebalancing of the Portfolio Weights
            # print('Actions', actions)
            # weights = self.softmax_normalization(actions)
            # print('Weights',weights)
            # if self.rebalance == True:
            #   if self.actions_memory:
            #       w_old = self.actions_memory[-1]
            #   else:
            #       w_old = np.zeros(self.stock_dim)

            #   rebalance_threshold = 0.01
            #   del_w = weights - w_old
            #   del_w[np.abs(del_w) < rebalance_threshold] = 0
            #   new_w = w_old + del_w

            #   different_mask = w_old != new_w
            #   sum_same = np.sum(new_w[~different_mask])
            #   rem_balance = 1 - sum_same

            #   different_entries = new_w[different_mask]
            #   sum_diff = np.sum(different_entries)
            #   if sum_diff != 0:
            #       normalized_diff = different_entries / sum_diff
            #   else:
            #       normalized_diff = different_entries

            #   new_w[different_mask] = normalized_diff*rem_balance

            #   weights = new_w

            #print("Weights: ",weights)

            #print("Weights: ",weights
            #print("Normalized actions: ", weights)
            self.actions_memory.append(weights)
            last_day_memory = self.data

            #load next state
            self.day = self.day +  1
            self.data = self.df.loc[self.day,:]
            self.covs = self.data['cov_list'].values[0]
            self.state =  np.append(np.array(self.covs), [self.data[tech].values.tolist() for tech in self.tech_indicator_list ], axis=0)
            #print(self.state)
            hist_volll = self.hist_vol.values[self.day,:]
            self.state = np.concatenate([self.state, hist_volll.reshape(1,-1) ], axis=0)

            #print(self.state)
            portfolio_return = sum(((self.data.close.values / last_day_memory.close.values)-1)*weights)
            self.Return_queue.appendleft(portfolio_return)

            #...Weights tbc by investor´s preference
            # portfolio_return = sum(((self.data.close.values / last_day_memory.close.values)-1)*weights)
            # update portfolio value
            if self.rebalance == True:
              new_portfolio_value = self.portfolio_value*(1+portfolio_return) - np.sum(self.transaction_cost_pct*self.portfolio_value*del_w)
            else:
              new_portfolio_value = self.portfolio_value*(1+portfolio_return) - np.sum(self.transaction_cost_pct*self.portfolio_value*weights)

            rew = 0
            for i in range(self.covs.shape[0]):
              for j in range(self.covs.shape[1]):
                rew = rew + weights[i]*weights[j]*self.covs[i][j]

            #Aqui es donde hay que ponderar el ESG.
            self.portfolio_value = new_portfolio_value
            old_portfolio_value = self.asset_memory[-1]

            # save into memory
            self.portfolio_return_memory.append(portfolio_return)
            self.date_memory.append(self.data.date.unique()[0])
            self.asset_memory.append(new_portfolio_value)
            # Calculate Transaction Fee
            phi = 0.0025  # 0.25% transaction cost
            # # Reshape portfolio_value to match dimensions of other arrays
            # portfolio_value_reshaped = np.repeat(self.portfolio_value, len(weights))
            transaction_fee = phi * sum(
                abs(weights * new_portfolio_value * last_day_memory.close.values / self.data.close.values
                    - self.actions_memory[-2] * portfolio_value_reshaped)  # Use portfolio_value_reshaped
            )

            # the reward is the new portfolio value or end portfolo value
            self.reward = new_portfolio_value - old_portfolio_value - transaction_fee
            # self.reward = new_portfolio_value - transaction_fee   # normal portfolio return
            # self.reward = np.log(new_portfolio_value/ old_portfolio_value)      # log return of portfolio
            # self.reward = self.calculate_DSR(new_portfolio_value)             # Differential Sharpe ratio
            # self.reward = -((max(self.asset_memory) - new_portfolio_value)/ (max(self.asset_memory) + 1e-7)) * 100          # drawdown ( try to bring the gap between highest and current value to 0)
            # self.reward = self.calculate_MDDR(new_portfolio_value)             # MDD with return
            # self.reward = -rew*1000
            #print("Step reward: ", self.reward)
            #self.reward = self.reward*self.reward_scaling
            # r1 = self.calculate_DSR(new_portfolio_value)
            # r2 = np.log(new_portfolio_value/ old_portfolio_value)
            # r3 = ((max(self.asset_memory) - new_portfolio_value)/ (max(self.asset_memory) + 1e-7)) * 100
            # self.reward = 0.25*r1 + 0.415*r2 - 0.335*r3

        return self.state, self.reward, self.terminal, {}

    def reset(self):
        self.asset_memory = [self.initial_amount]
        self.day = 0
        self.data = self.df.loc[self.day,:]
        # load states
        self.covs = self.data['cov_list'].values[0]
        self.state =  np.append(np.array(self.covs), [self.data[tech].values.tolist() for tech in self.tech_indicator_list ], axis=0)

        hist_voll = self.hist_vol.values[self.day,:]
        self.state = np.concatenate([self.state, hist_voll.reshape(1,-1) ], axis=0)

        self.portfolio_value = self.initial_amount
        #self.cost = 0
        #self.trades = 0
        self.DSR_A = 0.0
        self.DSR_B = 0.0
        self.terminal = False
        self.portfolio_return_memory = [0]
        self.actions_memory=[[1/self.stock_dim]*self.stock_dim]
        self.date_memory=[self.data.date.unique()[0]]
        return self.state

    def render(self, mode='human'):
        return self.state

    def softmax_normalization(self, actions):
        numerator = np.exp(actions)
        denominator = np.sum(np.exp(actions))
        softmax_output = numerator/denominator
        return softmax_output


    def save_asset_memory(self):
        date_list = self.date_memory
        portfolio_return = self.portfolio_return_memory
        #print(len(date_list))
        #print(len(asset_list))
        df_account_value = pd.DataFrame({'date':date_list,'daily_return':portfolio_return})
        return df_account_value

    def save_action_memory(self):
        # date and close price length must match actions length
        date_list = self.date_memory
        df_date = pd.DataFrame(date_list)
        df_date.columns = ['date']

        action_list = self.actions_memory
        df_actions = pd.DataFrame(action_list)
        df_actions.columns = self.data.tic.values
        df_actions.index = df_date.date
        #df_actions = pd.DataFrame({'date':date_list,'actions':action_list})
        return df_actions

    def _seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def get_sb_env(self):
        e = DummyVecEnv([lambda: self])
        obs = e.reset()
        return e, obs

    def calculate_DSR(self, R):
      eta = 0.004
      delta_A = R - self.DSR_A
      delta_B = R**2 - self.DSR_B
      Dt = (self.DSR_B*delta_A - 0.5*self.DSR_A*delta_B) / ((self.DSR_B-self.DSR_A**2)**(3/2) + 1e-6)
      self.DSR_A = self.DSR_A + eta*delta_A
      self.DSR_B = self.DSR_B + eta*delta_B
      return(Dt)

    def calculate_MDDR(self, R):
      k = 1
      a = 3
      mdd = ((max(self.asset_memory) - R)/ (max(self.asset_memory) + 1e-7)) * 100
      mddr = (1 / (1 + np.exp(-R))) * (-np.exp(mdd) + np.exp(a))
      return(mddr)

    def repair_portfolio(self, actions, K, l, u):
      """
      Parameters:
          actions (numpy array): Raw action vector from DRL agent.
          K (int): Maximum number of selected assets (cardinality constraint).
          lower_bounds (numpy array): Minimum allowed weights (l_i).
          upper_bounds (numpy array): Maximum allowed weights (u_i).

      Returns:
          numpy array: Adjusted and valid portfolio weights.
      """

      # Step 1: Get indices of Top-K values (sorted in descending order)
      top_k_indices = np.argsort(actions)[::-1][:K]  # Get indices of top K largest values

      # Step 2: Extract Top-K values using these indices
      weights = actions[top_k_indices]
      weights = np.maximum(weights, 0)  # Replace negative values with 0
      # Step 3: Apply a mathematical operation (Softmax normalization)
      sum_weights = np.sum(weights)

      # Step 4: Apply a mathematical operation (Repair mechanism)
      if sum_weights > 1:
          # If sum of weights is greater than 1, reduce values proportionally
          modified_values = []
          for i in weights:
              numerator = (i - 0.01)
              denominator = np.sum(weights - l)  # Summation over j
              repaired_value = 0.01 + (numerator / denominator) * (1 - len(weights) * 0.01)
              modified_values.append(repaired_value)

      elif sum_weights < 1:
          # If sum of weights is less than 1, increase values proportionally
          modified_values = []
          for i in weights:
              numerator = (1 - i)
              denominator = np.sum(u - weights)  # Summation over j
              repaired_value = 1 - (numerator / denominator) * (len(weights) * 1 - 1)
              modified_values.append(repaired_value)

      else:
          # If sum of weights is exactly 1, no modification is needed
          modified_values = weights

      # Convert to NumPy array
      modified_values = np.array(modified_values)
      modified_values = np.clip(modified_values, 0.01, 1) # Bound on K assets
      # Step 4: Create a new array and assign 0 to all positions
      result_arr = np.full_like(actions, 0)  # Initialize everything with 0

      # Step 5: Assign modified Softmax values to the Top-K indices
      result_arr[top_k_indices] = modified_values  # Assign only to top K elements

      # Step 6: Identify fixed values (0)
      fixed_mask = (result_arr == 0.01)  # Boolean mask where values are exactly 0
      fixed_sum = np.sum(result_arr[fixed_mask])  # Sum of fixed values

      # Step 7: Compute current sum and required adjustment
      total_sum = np.sum(result_arr)
      adjustable_sum = total_sum - fixed_sum  # Sum of changeable values
      target_sum = 1 - fixed_sum  # Required sum for adjustable values

      # Step 8: Rescale only the non-fixed values
      result_arr[~fixed_mask] *= (target_sum / adjustable_sum)  # Scale non-fixed values

      return result_arr

In [28]:
stock_dimension = len(train.tic.unique())
state_space = stock_dimension
print(f"Stock Dimension: {stock_dimension}, State Space: {state_space}")

Stock Dimension: 47, State Space: 47


In [29]:
turbulence_threshold= 0.03
env_kwargs = {
    "hmax": 100,
    "initial_amount": 1000000,
    "transaction_cost_pct": 0.001,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "Rebalance":False,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4,
    "turbulence_threshold": turbulence_threshold,
    "hist_vol": hist_vol_train
}

In [30]:
trade_env_kwargs = {
    "hmax": 100,
    "initial_amount": 1000000,
    "transaction_cost_pct": 0.001,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "Rebalance":False,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4,
    "turbulence_threshold": turbulence_threshold,
    "hist_vol": hist_vol_trade
}

In [31]:
# # Uncomment you want results without Transection Cost
# env_kwargs = {
#     "hmax": 100,
#     "initial_amount": 1000000,
#     "transaction_cost_pct": 0,
#     "state_space": state_space,
#     "stock_dim": stock_dimension,
#     "tech_indicator_list": INDICATORS,
#     "Rebalance":False,
#     "action_space": stock_dimension,
#     "reward_scaling": 1e-4,
# }

# trade_env_kwargs = {
#     "hmax": 100,
#     "initial_amount": 1000000,
#     "transaction_cost_pct": 0,
#     "state_space": state_space,
#     "stock_dim": stock_dimension,
#     "tech_indicator_list": INDICATORS,
#     "Rebalance":True,
#     "action_space": stock_dimension,
#     "reward_scaling": 1e-4,
# }

In [32]:
e_train_gym = StockPortfolioEnv(df = train, **env_kwargs)
env_train, _ = e_train_gym.get_sb_env()

e_val_gym = StockPortfolioEnv(df = val, **trade_env_kwargs)
env_val, _ = e_val_gym.get_sb_env()

e_train_full_gym = StockPortfolioEnv(df = full_train, **env_kwargs)
env_full_train, _ = e_train_full_gym.get_sb_env()

e_trade_gym = StockPortfolioEnv(df = trade, **trade_env_kwargs)
env_trade, _ = e_trade_gym.get_sb_env()



<a id='5'></a>
# Part 6: Implement DRL Algorithms
* ## PPO

In [33]:
import torch
import  torch.nn as nn
import torch.optim as optim
from torch.nn import Flatten
from scipy.stats import multivariate_normal
from torch.autograd import grad

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim, num_layers, act_fn, dr):
        super(Actor, self).__init__()
        self.action_dim = action_dim  # Store action dimension
        layers = []

        if act_fn == 'relu': activation_fn = nn.ReLU()
        if act_fn == 'tanh': activation_fn = nn.Tanh()
        if act_fn == 'sigmoid': activation_fn = nn.Sigmoid()

        # Add input layer
        layers.append(nn.Flatten())
        layers.append(nn.Linear(state_dim, hidden_dim))
        layers.append(activation_fn)
        layers.append(nn.Dropout(p=dr))

        # Add hidden layers
        for _ in range(num_layers - 2):  # -2 because we already added the input and output layers
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(activation_fn)
            layers.append(nn.Dropout(p=dr))

        # Add output layer
        layers.append(nn.Linear(hidden_dim, action_dim))

        # Create the sequential model
        self.model = nn.Sequential(*layers)

        logstds_param = nn.Parameter(torch.full((action_dim,), -0.5))
        self.register_parameter("logstds", logstds_param)

    def forward(self, state):
        # print(" Actor forward ((((((((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))))))))")
        # 🔍 Print debug info to check tensor shapes
        # print("state :: ", type(state) , state.shape)

        x = self.model(torch.tensor(state))
        # print(" x from model:: ", type(x) , x.shape)

        means = torch.tanh(x)
        stds = torch.clamp(self.logstds.exp(), 1e-3, 0.5)
        # cov_mat = torch.diag_embed(stds)
        cov_mat = torch.diag_embed(stds) + 1e-6 * torch.eye(self.action_dim) # Add this line
        return torch.distributions.MultivariateNormal(means, cov_mat)

    # def forward(self, state):
    #   x = self.model(state)
    #   means = torch.tanh(x)
    #   if torch.isnan(means).any() or torch.isinf(means).any():
    #     print("Warning: NaN detected in means! Clamping values.")
    #     means = torch.clamp(means, -1e3, 1e3)
    #   stds = torch.clamp(self.logstds.exp(), 1e-3, 0.5)
    #   # cov_mat = torch.diag_embed(stds)
    #   cov_mat = torch.diag_embed(stds) + 1e-6 * torch.eye(self.action_dim) # Add this line
    #   return torch.distributions.MultivariateNormal(means, cov_mat)




class Critic(nn.Module):
    def __init__(self, state_dim,action_dim, hidden_dim, num_layers, act_fn, dr):
        super(Critic, self).__init__()

        layers = []

        if act_fn == 'relu': activation_fn = nn.ReLU()
        if act_fn == 'tanh': activation_fn = nn.Tanh()
        if act_fn == 'sigmoid': activation_fn = nn.Sigmoid()

        # Add input layer
        hidden_dim = int(hidden_dim)
        num_layers = int(num_layers)
        action_dim = int(action_dim)
        state_dim = int(state_dim)



        layers.append(nn.Linear(state_dim + action_dim, hidden_dim))
        layers.append(activation_fn)
        layers.append(nn.Dropout(p=dr))


        # Add hidden layers
        for _ in range(num_layers - 2):  # -2 because we already added the input and output layers
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(activation_fn)
            layers.append(nn.Dropout(p=dr))

        # Add output layer
        layers.append(nn.Linear(hidden_dim, 1))

        # Create the sequential model
        self.model = nn.Sequential(*layers)

    def forward(self, state, action):
        state = state.reshape(state.shape[0], -1)
        state = torch.FloatTensor(state)
        action = torch.FloatTensor(action)
        # print(" Critic forward ((((((((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))))))))")
        # # 🔍 Print debug info to check tensor shapes
        # print("state critc fwd :: ", type(state) , state.shape)
        # print("action critic fwd :: ", type(action) , action.shape)

        x = torch.cat([state, action], dim=1)
        # x = self.model(x)
        # return x

        # # 🔄 Flatten state if it has more than 2 dimensions (CNN case)
        # if state.dim() > 2:
        #     state = state.view(state.shape[0], -1)  # Convert to (batch_size, features)

        # # 🔄 Ensure action is 2D
        # if action.dim() > 2:
        #     action = action.view(action.shape[0], -1)  # Convert to (batch_size, action_dim)

        # # 🔍 Print final shapes
        # # print(f"State shape after reshape: {state.shape}, Action shape after reshape: {action.shape}")

        # # ✅ Now both state and action are 2D → Safe to concatenate
        # x = torch.cat([state, action], dim=1)
        # print(" return  x  critic fwd:: ", x.shape)
        # Forward pass through Critic layers
        x = self.model(x)
        # print(" return  x  critic fwd after passing to model :: ", x.shape)
        return x



class CostNetwork(nn.Module):
    """
    Neural network for estimating portfolio risk (cost).
    """
    def __init__(self, state_dim, action_dim, hidden_dim,num_layers, act_fn, dr):
        super(CostNetwork, self).__init__()

        state_dim=int(state_dim)
        action_dim=int(action_dim)
        hidden_dim=int(hidden_dim)
        # print("state_dim:", state_dim)
        # print("action_dim:", action_dim)
        # print("hidden_dim:", hidden_dim)

        self.model = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)  # Outputs cost estimate
        )

    def forward(self, state, action):
        """
        Forward pass for the cost network.

        Computes:
        c_wv(s, a) = E[VaR(s, a)]  (Eq. 19 in the paper)

        Args:
        - state (torch.Tensor): State tensor with shape [batch_size, *]
        - action (torch.Tensor): Action tensor with shape [batch_size, action_dim]

        Returns:
        - Cost estimation (torch.Tensor)
        """

        # print(" cost network forward ((((((((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))))))))")
        # 🔍 Print debug info to check tensor shapes
        # print("state :: ", type(state) , state.shape)
        # print("action :: ", type(action) , action.shape)
        # 🔄 Flatten state if it has more than 2 dimensions
        if state.dim() > 2:
            state = state.view(state.shape[0], -1)  # Reshape to [batch_size, flattened_features]

        # 🔄 Ensure action is 2D
        if action.dim() > 2:
            action = action.view(action.shape[0], -1)  # Reshape to [batch_size, action_dim]

        # 🔍 Print final shapes
        # print(f"State shape after reshape: {state.shape}, Action shape after reshape: {action.shape}")

        # ✅ Now both state and action are 2D → Safe to concatenate
        x = torch.cat([state, action], dim=1)
        # Forward pass through the Cost network
        return self.model(x)





In [34]:
class PPOagent:
  def __init__(self, env, params, eps_clip=0.2):
    # Params
    self.num_states = env.observation_space.shape[0]*env.observation_space.shape[1]
    self.num_actions = env.action_space.shape[0]
    self.gamma = params['gamma']
    self.env = env
    # self.PPO_epochs = int(params['PPO_epochs'])
    self.PPO_epochs = 10
    self.value_coeff = params['val_coeff']
    self.ent_coeff = params['ent_coeff']
    self.eps_clip = eps_clip

    # constraint params--
    self.rho = 0.01
    self.violations= 0
    self.zeta= env.envs[0].turbulence_threshold
    self.lambda_ = torch.tensor(0.01, requires_grad=False).to(device)

    # Networks
    self.policy = Actor(self.num_states, self.num_actions, int(params['Ahidden_dim']), int(params['Anum_layers']), params['Aact_fn'], params['Adr']).to(device)
    self.critic = Critic(self.num_states, self.num_actions, int(params['Chidden_dim']), int(params['Cnum_layers']), params['Cact_fn'], params['Cdr']).to(device)
    self.cost_network = CostNetwork(self.num_states, self.num_actions, params['Costhidden_dim'], int (params['Costnum_layers']),params['Costact_fn'], params['Cdr']  ).to(device)
    self.cost_target = CostNetwork(self.num_states, self.num_actions, int(params['Costhidden_dim']), int (params['Costnum_layers']),params['Costact_fn'], params['Costdr']).to(device)
    self.critic_criterion = nn.MSELoss()
    self.cost_criterion = nn.MSELoss()

    # Training
    self.optimizer = torch.optim.Adam([
                        {'params': self.policy.parameters(), 'lr': params['alr']},
                        {'params': self.critic.parameters(), 'lr': params['clr']},
                        {'params': self.cost_network.parameters(), 'lr': params['costlr']}])


    self.actor_optimizer = optim.Adam(self.policy.parameters(), lr=params['alr'])
    self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=params['clr'])
    self.cost_optimizer = optim.Adam(self.cost_network.parameters(), lr=params['clr'])
    self.MseLoss = nn.MSELoss().to(device)


  def get_action(self, state):
    # state_tensor = torch.FloatTensor(state).to(device)
    norm_dists = self.policy(state)
    action = torch.tanh(norm_dists.sample())
    logs_probs = norm_dists.log_prob(action)
    entropy = norm_dists.entropy()
    return action, logs_probs, entropy


  def calculate_returns(self, rewards, discount_factor):
    returns = []
    R = 0
    for r in reversed(rewards):
        R = r + R * discount_factor
        returns.insert(0, R)

    return torch.tensor(returns)

  def VaR(self, states, actions, confidence_level=0.95):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        actions = actions.to(device)
        states = states.to(device)  # assume actions is already on the correct device

        batch_size = states.shape[0]  # ✅ Do NOT use `.to(device)` here
        num_assets = 30

        states = states.squeeze(1).to(device)  # [batch_size, 38, 30]
        states_n = states  # already squeezed

        cov_matrix = states[:, :num_assets, :].to(device)  # [batch_size, 30, 30]
        hist_volatility = states_n[:, -1, :].to(device)  # [batch_size, 30]

        z_score = torch.tensor(1.645, device=device)  # ✅ place tensor on the same device
        individual_VaR = z_score * hist_volatility  # [batch_size, 30]

        VaR_portfolio = torch.zeros(batch_size, device=device)  # ✅ directly initialize on device

        for i in range(num_assets):
            for j in range(num_assets):
                VaR_portfolio = VaR_portfolio + (
                    actions[:, i] * individual_VaR[:, i] *
                    actions[:, j] * individual_VaR[:, j] * cov_matrix[:, i, j]
                )

        return VaR_portfolio


  def compute_cost_target(self, states, actions, next_states ):
        """
        Compute the target cost using the Bellman equation.

        Equation (20):
        c_{w_v}(s, a) = VaR(s, a) + \eta (1 - d) c'_{w_v'}(s', a')
        """

        next_actions = self.policy.forward(next_states)  # π'(s')
        next_actions = next_actions.sample().detach()

        # print(" in compute_cost_target  next_actions.shape :: ", next_actions.shape)
        #print(" in compute_cost_target  next_actions.shape :: ", next_actions.shape)

        next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')
        # print(" in compute_cost_target  nect cost .shape  ::  ", next_cost.shape)

        cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost
        # print(" in compute_cost_target  cost_target.shape :: ", cost_target.shape)
        return cost_target


  def update_policy(self, states, actions, log_prob_actions, advantages, returns):
    old_states = states.detach().to(device)
    old_actions = actions.detach().to(device)
    old_log_probs = log_prob_actions.detach().to(device)
    advantages = advantages.detach().to(device)
    returns = returns.detach().to(device)


    # print("old_states ::" , old_states.shape)
    # print("old_actions ::" , old_actions.shape)
    # print("old_log_probs ::" , old_log_probs.shape)
    # print("advantages ::" , advantages.shape)
    # print("returns ::" , returns.shape)



    for _ in range(self.PPO_epochs):
      #get new log prob of old actions for all input states
      action, log_probs_new, entropy = self.get_action(old_states)
      next_state, reward, done, _ = self.env.step(action.detach().numpy())

      # print("action in loop ::" , action.shape)
      # print("log_probs_new in loop::" , log_probs_new.shape)
      # print("entropy in loop ::" , entropy.shape)

      value_pred = self.critic.forward(old_states, action)

      # print("next state in loop :: " , value_pred.shape)



      cost_target = self.compute_cost_target(old_states, old_actions, next_state ).detach()
      # print("cost_tgt in loop  :: ", cost_target.shape)

      # violations --
      violations_count = (cost_target > self.zeta).sum().item()  # Count how many elements violate the constraint
      print(" violations ::: " , violations_count)
      # Update the number of violations
      self.violations  =  self.violations + violations_count

      # C= c(si, ai) − ζ, if c(si, ai) >=  ζ else 0
      constraint_penalty = torch.where(
          cost_target <= self.zeta,
          torch.tensor(0.0, device=cost_target.device, dtype=cost_target.dtype),
          cost_target - self.zeta
      )
      quadratic_penalty = (self.rho / 2) * (constraint_penalty ** 2).mean().clone()
      constraint_penalty =(self.lambda_) * constraint_penalty.mean().clone()
      final_loss= constraint_penalty + quadratic_penalty


      policy_ratio = (log_probs_new - old_log_probs).exp()
      policy_loss_1 = policy_ratio * advantages
      policy_loss_2 = torch.clamp(policy_ratio, min=1.0 - self.eps_clip, max=1.0 + self.eps_clip) * advantages

      policy_loss = - torch.min(policy_loss_1, policy_loss_2).mean()

      value_loss = self.MseLoss(self.critic.forward(old_states, action), returns)

      loss = policy_loss + self.value_coeff * value_loss - self.ent_coeff * entropy.mean()
      loss= -1*loss + final_loss


      # cost loss back --

      cost_pred = self.cost_network.forward(states, actions)
      cost_loss = self.cost_criterion(cost_pred, cost_target)

      #critic loss back --

      next_actions = self.policy.forward(next_state)  # π'(s')
      next_actions = next_actions.sample().detach()
      # Q_target = reward  + self.gamma  * self.critic.forward(torch.tensor(next_state), torch.tensor(next_actions))
      next_state_tensor = torch.tensor(next_state, dtype=torch.float32).to(device)
      next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)
      # print(" Q'  forward critic::  and reward type:: ", type(reward) )
      Q_target = torch.tensor(reward) + self.gamma * self.critic.forward(next_state_tensor, next_action_tensor)

      critic_loss = self.critic_criterion(value_pred, Q_target.detach())



      # take gradient step
      self.actor_optimizer.zero_grad()
      loss.backward()
      self.actor_optimizer.step()

      # take gradient step
      # self.policy_optimizer.zero_grad()
      # loss.backward()
      # self.policy_optimizer.step()

      self.critic_optimizer.zero_grad()
      critic_loss.backward()
      self.critic_optimizer.step()

      self.cost_optimizer.zero_grad()
      cost_loss.backward()
      self.cost_optimizer.step()

  def trade(self, val_env, e_val_gym, n_steps):
    Reward = []
    state = val_env.reset()
    if n_steps == None:
      n_steps = len(e_val_gym.df.index.unique())
    for i in range(n_steps):
      state_tensor = torch.FloatTensor(state).to(device)
      action, logs_probs, _ = self.get_action(state_tensor)
      action = action.cpu()
      next_obs, reward, done, _ = val_env.step(action.detach().numpy())
      Reward.append(reward)

      if i == (n_steps - 2):
          account_memory = val_env.env_method(method_name="save_asset_memory")
          actions_memory = val_env.env_method(method_name="save_action_memory")

      if done[0]:
        print("hit end!")
        # account_memory = val_env.env_method(method_name="save_asset_memory")
        # actions_memory = val_env.env_method(method_name="save_action_memory")
        break
      state = next_obs

    # account_memory = val_env.env_method(method_name="save_asset_memory")
    # actions_memory = val_env.env_method(method_name="save_action_memory")
    return account_memory, actions_memory, sum(Reward)




In [35]:
#device = 'cpu'
# Set the device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [36]:
#Calculate the Sharpe ratio
#This is our objective for tuning
def calculate_sharpe(df):
  #df['daily_return'] = df['account_value'].pct_change(1)
  if df['daily_return'].std() !=0:
    sharpe = (252**0.5)*df['daily_return'].mean()/ \
          df['daily_return'].std()
    return sharpe
  else:
    return 0

In [37]:
space = {
    'Ahidden_dim': hp.quniform('Ahidden_dim', 2, 256,1),
    'Anum_layers': hp.quniform('Anum_layers', 1, 8,1),
    'Chidden_dim': hp.quniform('Chidden_dim', 2, 256, 1),
    'Cnum_layers': hp.quniform('Cnum_layers', 1, 8,1),
    'Costhidden_dim': hp.quniform('Costhidden_dim', 2, 256, 1),
    'Costnum_layers': hp.quniform('Costnum_layers', 1, 8,1),
    'alr': hp.loguniform('alr', -8, -1),
    'clr': hp.loguniform('clr', -8, -1),
    'costlr': hp.loguniform('costlr', -8, -1),

    'gamma': hp.uniform('gamma', 0.9, 0.99),
    'PPO_epochs': hp.quniform('PPO_epochs', 5, 50, 5),
    'val_coeff': hp.uniform('val_coeff', 0.5, 1.0),
    'ent_coeff': hp.uniform('ent_coeff', 0.01, 0.1),
    'Aact_fn': hp.choice('Aact_fn', ['relu', 'tanh', 'sigmoid']),
    'Adr': hp.uniform('Adr', 0, 0.5),
    'Cact_fn': hp.choice('Cact_fn', ['relu', 'tanh', 'sigmoid']),
    'Cdr': hp.uniform('Cdr', 0, 0.5),
    'Costdr': hp.uniform('Costdr', 0, 0.5),
    'Costact_fn': hp.choice('Costact_fn', ['relu', 'tanh', 'sigmoid'])

}

def objective(params):
    print(params)
    model = PPOagent(env_train, params)
    num_steps = 512
    Actions = []
    States = []
    Rewards = []
    Log_probs = []
    Values = []

    state = env_train.reset()
    done = False
    episode_reward = 0

    for i in range(num_steps):
      state_tensor = torch.FloatTensor(state).to(device)



      action, logs_probs, entropy = model.get_action(state_tensor)
      state_value = model.critic(state_tensor, action).to(device)
      States.append(state_tensor)
      Values.append(state_value)

      action = action.cpu()
      next_state, reward, done, _ = env_train.step(action.detach().numpy())

      Actions.append(action)
      Rewards.append(reward)
      Log_probs.append(logs_probs)
      state = next_state

      if done:
        break

    actions = torch.cat(Actions).to(device)
    states = torch.cat(States).to(device)
    values = torch.cat(Values).to(device)
    log_prob_old = torch.cat(Log_probs).to(device)
    returns = (model.calculate_returns(Rewards, model.gamma)).to(device)
    advantages = (returns - values).to(device)
    advantages = ((advantages - advantages.mean()) / (advantages.std() + 1e-5)).to(device)

    model.update_policy(states, actions, log_prob_old, advantages, returns)

    account_memory, actions_memory, rewardd = model.trade(env_val, e_val_gym, None)
    print(rewardd)

    sharpe = calculate_sharpe(account_memory[0])
    return -sharpe

In [38]:
import torch
print(torch.__version__)

2.6.0+cu124


In [39]:
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=50, trials=Trials())  #max_evals = 500

{'Aact_fn': 'sigmoid', 'Adr': 0.3241205788469952, 'Ahidden_dim': 135.0, 'Anum_layers': 7.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.13219977903530916, 'Chidden_dim': 133.0, 'Cnum_layers': 2.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.1093469078699354, 'Costhidden_dim': 188.0, 'Costnum_layers': 6.0, 'PPO_epochs': 50.0, 'alr': 0.005026871573910045, 'clr': 0.0005067721429633308, 'costlr': 0.0037648676279506553, 'ent_coeff': 0.026693278670735933, 'gamma': 0.9780876184665823, 'val_coeff': 0.7636126944750488}
  0%|          | 0/50 [00:00<?, ?trial/s, best loss=?]

  x = self.model(torch.tensor(state))



 violations ::: 
1
  0%|          | 0/50 [00:05<?, ?trial/s, best loss=?]

  return torch.tensor(returns)

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:872892.8324520792
Sharpe: 
0.7625783125689682
hit end!
[2.2621973e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.47292107748590106, 'Ahidden_dim': 38.0, 'Anum_layers': 6.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.3896814115962629, 'Chidden_dim': 133.0, 'Cnum_layers': 1.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.3379092619063623, 'Costhidden_dim': 19.0, 'Costnum_layers': 1.0, 'PPO_epochs': 40.0, 'alr': 0.3507324145972415, 'clr': 0.17687903765589538, 'costlr': 0.000641959962672039, 'ent_coeff': 0.08245241652106765, 'gamma': 0.9179059463981806, 'val_coeff': 0.5024142023580147}
  2%|▏         | 1/50 [00:09<07:56,  9.72s/trial, best loss: -0.7625783125689682]

  x = self.model(torch.tensor(state))



 violations ::: 
0
  2%|▏         | 1/50 [00:12<07:56,  9.72s/trial, best loss: -0.7625783125689682]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:838729.722590527
Sharpe: 
0.518764460370446
hit end!
[2.2113381e+08]
{'Aact_fn': 'tanh', 'Adr': 0.3234030607996087, 'Ahidden_dim': 148.0, 'Anum_layers': 6.0, 'Cact_fn': 'tanh', 'Cdr': 0.2911838674297453, 'Chidden_dim': 75.0, 'Cnum_layers': 8.0, 'Costact_fn': 'relu', 'Costdr': 0.18244681181102124, 'Costhidden_dim': 44.0, 'Costnum_layers': 5.0, 'PPO_epochs': 20.0, 'alr': 0.0005686430067617193, 'clr': 0.22096300881653866, 'costlr': 0.001734762117575035, 'ent_coeff': 0.039703334973830794, 'gamma': 0.9327859104728524, 'val_coeff': 0.767135163865454}
  4%|▍         | 2/50 [00:15<06:02,  7.56s/trial, best loss: -0.7625783125689682]

  x = self.model(torch.tensor(state))



 violations ::: 
1
  4%|▍         | 2/50 [00:20<06:02,  7.56s/trial, best loss: -0.7625783125689682]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:866355.0293069895
Sharpe: 
0.6996872530359829
hit end!
[2.2591979e+08]
{'Aact_fn': 'relu', 'Adr': 0.37902554156694385, 'Ahidden_dim': 169.0, 'Anum_layers': 3.0, 'Cact_fn': 'relu', 'Cdr': 0.4110830217616581, 'Chidden_dim': 46.0, 'Cnum_layers': 4.0, 'Costact_fn': 'relu', 'Costdr': 0.31749413641819657, 'Costhidden_dim': 23.0, 'Costnum_layers': 1.0, 'PPO_epochs': 25.0, 'alr': 0.003528025475848866, 'clr': 0.006457868222046496, 'costlr': 0.008908896202075156, 'ent_coeff': 0.0494745397419007, 'gamma': 0.924097573805341, 'val_coeff': 0.5029076501000036}
  6%|▌         | 3/50 [00:23<05:58,  7.63s/trial, best loss: -0.7625783125689682]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 violations ::: 
1
  6%|▌         | 3/50 [00:26<05:58,  7.63s/trial, best loss: -0.7625783125689682]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:844602.5318569294
Sharpe: 
0.5531667569629377
hit end!
[2.22537e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.3653868423748056, 'Ahidden_dim': 81.0, 'Anum_layers': 4.0, 'Cact_fn': 'tanh', 'Cdr': 0.11378470370590793, 'Chidden_dim': 197.0, 'Cnum_layers': 3.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.07966431433596177, 'Costhidden_dim': 129.0, 'Costnum_layers': 3.0, 'PPO_epochs': 45.0, 'alr': 0.2255807538132718, 'clr': 0.07863940957968876, 'costlr': 0.05559443615301191, 'ent_coeff': 0.033614156145590934, 'gamma': 0.9590821028148496, 'val_coeff': 0.8469597189012572}
  8%|▊         | 4/50 [00:29<05:18,  6.93s/trial, best loss: -0.7625783125689682]

  x = self.model(torch.tensor(state))



 violations ::: 
1
  8%|▊         | 4/50 [00:33<05:18,  6.93s/trial, best loss: -0.7625783125689682]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:883138.8406473844
Sharpe: 
0.8167760760814715
hit end!
[2.276708e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.2645220560016892, 'Ahidden_dim': 5.0, 'Anum_layers': 5.0, 'Cact_fn': 'relu', 'Cdr': 0.46262060427535867, 'Chidden_dim': 236.0, 'Cnum_layers': 7.0, 'Costact_fn': 'relu', 'Costdr': 0.13215857778049772, 'Costhidden_dim': 64.0, 'Costnum_layers': 4.0, 'PPO_epochs': 25.0, 'alr': 0.004086145886087199, 'clr': 0.22069739893851711, 'costlr': 0.007186427413739048, 'ent_coeff': 0.09271603064682014, 'gamma': 0.9254829714214561, 'val_coeff': 0.6694191765815811}
 10%|█         | 5/50 [00:37<05:27,  7.27s/trial, best loss: -0.8167760760814715]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 10%|█         | 5/50 [00:40<05:27,  7.27s/trial, best loss: -0.8167760760814715]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:841105.1456064438
Sharpe: 
0.5260642561294967
hit end!
[2.2273795e+08]
 12%|█▏        | 6/50 [00:44<05:24,  7.38s/trial, best loss: -0.8167760760814715]

  x = self.model(torch.tensor(state))



{'Aact_fn': 'relu', 'Adr': 0.43484798337181907, 'Ahidden_dim': 7.0, 'Anum_layers': 3.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.46511555605038307, 'Chidden_dim': 209.0, 'Cnum_layers': 4.0, 'Costact_fn': 'tanh', 'Costdr': 0.3485523460577522, 'Costhidden_dim': 128.0, 'Costnum_layers': 2.0, 'PPO_epochs': 30.0, 'alr': 0.0004958642091982442, 'clr': 0.11077945555113615, 'costlr': 0.03490359913593958, 'ent_coeff': 0.09395651655253555, 'gamma': 0.9644234749476328, 'val_coeff': 0.9112215655595758}
 violations ::: 
1
 12%|█▏        | 6/50 [00:49<05:24,  7.38s/trial, best loss: -0.8167760760814715]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:877689.1922010196
Sharpe: 
0.7898764242908118
hit end!
[2.263475e+08]
{'Aact_fn': 'tanh', 'Adr': 0.40212401464374437, 'Ahidden_dim': 192.0, 'Anum_layers': 2.0, 'Cact_fn': 'tanh', 'Cdr': 0.16013884410623525, 'Chidden_dim': 9.0, 'Cnum_layers': 2.0, 'Costact_fn': 'tanh', 'Costdr': 0.24963176809250465, 'Costhidden_dim': 152.0, 'Costnum_layers': 8.0, 'PPO_epochs': 40.0, 'alr': 0.03910122988886434, 'clr': 0.236829954548374, 'costlr': 0.004732838621995647, 'ent_coeff': 0.022678650575779884, 'gamma': 0.9735582956783164, 'val_coeff': 0.7537551492083526}
 14%|█▍        | 7/50 [00:52<05:26,  7.58s/trial, best loss: -0.8167760760814715]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 violations ::: 


  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:886703.6276943127
Sharpe: 
0.8319072800469514
hit end!
[2.2734424e+08]
{'Aact_fn': 'relu', 'Adr': 0.09944732672715084, 'Ahidden_dim': 158.0, 'Anum_layers': 5.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.3043923363064086, 'Chidden_dim': 238.0, 'Cnum_layers': 5.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.28524656731664044, 'Costhidden_dim': 93.0, 'Costnum_layers': 4.0, 'PPO_epochs': 25.0, 'alr': 0.07710495429110925, 'clr': 0.1640003501626101, 'costlr': 0.010717412614178815, 'ent_coeff': 0.08160446621111962, 'gamma': 0.9635133870186947, 'val_coeff': 0.7252612526217972}
 16%|█▌        | 8/50 [00:59<05:09,  7.36s/trial, best loss: -0.8319072800469514]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 16%|█▌        | 8/50 [01:03<05:09,  7.36s/trial, best loss: -0.8319072800469514]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:869854.5144190907
Sharpe: 
0.7287326930753478
hit end!
[2.2330966e+08]
{'Aact_fn': 'relu', 'Adr': 0.3853113131523138, 'Ahidden_dim': 227.0, 'Anum_layers': 8.0, 'Cact_fn': 'tanh', 'Cdr': 0.2935937349912493, 'Chidden_dim': 41.0, 'Cnum_layers': 1.0, 'Costact_fn': 'tanh', 'Costdr': 0.08059992990822046, 'Costhidden_dim': 253.0, 'Costnum_layers': 1.0, 'PPO_epochs': 30.0, 'alr': 0.00818742007644044, 'clr': 0.03844438232877545, 'costlr': 0.10193514457530074, 'ent_coeff': 0.01722844927963187, 'gamma': 0.9181191153964339, 'val_coeff': 0.7185245099282742}
 18%|█▊        | 9/50 [01:08<05:20,  7.83s/trial, best loss: -0.8319072800469514]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 18%|█▊        | 9/50 [01:13<05:20,  7.83s/trial, best loss: -0.8319072800469514]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:830519.5856554566
Sharpe: 
0.45282482354811643
hit end!
[2.2042571e+08]
{'Aact_fn': 'tanh', 'Adr': 0.1686570985793598, 'Ahidden_dim': 245.0, 'Anum_layers': 1.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.1474500688910822, 'Chidden_dim': 221.0, 'Cnum_layers': 4.0, 'Costact_fn': 'tanh', 'Costdr': 0.10057116797104282, 'Costhidden_dim': 106.0, 'Costnum_layers': 7.0, 'PPO_epochs': 15.0, 'alr': 0.016910619688876857, 'clr': 0.08980742439317468, 'costlr': 0.1271359202385507, 'ent_coeff': 0.09946501316678398, 'gamma': 0.957307212467019, 'val_coeff': 0.7708236835278963}
 20%|██        | 10/50 [01:16<05:15,  7.88s/trial, best loss: -0.8319072800469514]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 20%|██        | 10/50 [01:19<05:15,  7.88s/trial, best loss: -0.8319072800469514]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:844372.9015653364
Sharpe: 
0.5592247728040408
hit end!
[2.2237547e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.21895554964119768, 'Ahidden_dim': 17.0, 'Anum_layers': 2.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.3301679655647624, 'Chidden_dim': 164.0, 'Cnum_layers': 2.0, 'Costact_fn': 'tanh', 'Costdr': 0.46090807889576824, 'Costhidden_dim': 81.0, 'Costnum_layers': 4.0, 'PPO_epochs': 50.0, 'alr': 0.10278711294478157, 'clr': 0.001554446330867912, 'costlr': 0.04378905474748026, 'ent_coeff': 0.03212478755802551, 'gamma': 0.9077026026696231, 'val_coeff': 0.9380078405738391}
 22%|██▏       | 11/50 [01:25<05:24,  8.32s/trial, best loss: -0.8319072800469514]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 22%|██▏       | 11/50 [01:28<05:24,  8.32s/trial, best loss: -0.8319072800469514]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:839709.985312272
Sharpe: 
0.5070745196966623
hit end!
[2.198979e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.05319006904257745, 'Ahidden_dim': 156.0, 'Anum_layers': 7.0, 'Cact_fn': 'tanh', 'Cdr': 0.3784458257111509, 'Chidden_dim': 176.0, 'Cnum_layers': 7.0, 'Costact_fn': 'relu', 'Costdr': 0.13949881825878613, 'Costhidden_dim': 236.0, 'Costnum_layers': 1.0, 'PPO_epochs': 40.0, 'alr': 0.0007506952132429582, 'clr': 0.0033908999366316257, 'costlr': 0.006082079927628982, 'ent_coeff': 0.06341431693232702, 'gamma': 0.9576785540423449, 'val_coeff': 0.6825211075938309}
 24%|██▍       | 12/50 [01:31<04:49,  7.61s/trial, best loss: -0.8319072800469514]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 24%|██▍       | 12/50 [01:36<04:49,  7.61s/trial, best loss: -0.8319072800469514]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:892507.9273144087
Sharpe: 
0.8800923337843639
hit end!
[2.2900838e+08]
{'Aact_fn': 'relu', 'Adr': 0.34279097230586797, 'Ahidden_dim': 115.0, 'Anum_layers': 5.0, 'Cact_fn': 'tanh', 'Cdr': 0.08179257700211834, 'Chidden_dim': 139.0, 'Cnum_layers': 6.0, 'Costact_fn': 'relu', 'Costdr': 0.24808155760173445, 'Costhidden_dim': 212.0, 'Costnum_layers': 7.0, 'PPO_epochs': 15.0, 'alr': 0.013527243533873648, 'clr': 0.011187173205424684, 'costlr': 0.0018543399680713958, 'ent_coeff': 0.05518965833800499, 'gamma': 0.9383595093675922, 'val_coeff': 0.7501013357386768}
 26%|██▌       | 13/50 [01:41<05:04,  8.24s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 26%|██▌       | 13/50 [01:45<05:04,  8.24s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:842224.835152609
Sharpe: 
0.5291032932132562
hit end!
[2.2343192e+08]
{'Aact_fn': 'tanh', 'Adr': 0.4168957607047693, 'Ahidden_dim': 246.0, 'Anum_layers': 6.0, 'Cact_fn': 'relu', 'Cdr': 0.2015377704095908, 'Chidden_dim': 176.0, 'Cnum_layers': 5.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.034135612418260686, 'Costhidden_dim': 135.0, 'Costnum_layers': 4.0, 'PPO_epochs': 20.0, 'alr': 0.006959120413873, 'clr': 0.02053534734292499, 'costlr': 0.004269184512209411, 'ent_coeff': 0.06704398887122226, 'gamma': 0.9469098322116992, 'val_coeff': 0.7193002801822872}
 28%|██▊       | 14/50 [01:48<04:46,  7.97s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 28%|██▊       | 14/50 [01:53<04:46,  7.97s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:844778.3254287271
Sharpe: 
0.5487484983210834
hit end!
[2.2077726e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.27889880383226817, 'Ahidden_dim': 205.0, 'Anum_layers': 5.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.46764406020244476, 'Chidden_dim': 145.0, 'Cnum_layers': 4.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.1982502741428523, 'Costhidden_dim': 84.0, 'Costnum_layers': 2.0, 'PPO_epochs': 10.0, 'alr': 0.016806059629392855, 'clr': 0.22661229321752507, 'costlr': 0.001377384626091138, 'ent_coeff': 0.06880817250299527, 'gamma': 0.9201184546099339, 'val_coeff': 0.7706800931489128}
 30%|███       | 15/50 [01:56<04:37,  7.94s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 30%|███       | 15/50 [02:00<04:37,  7.94s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:859770.4435624287
Sharpe: 
0.6534910082014399
hit end!
[2.2324542e+08]
{'Aact_fn': 'tanh', 'Adr': 0.2496018878861448, 'Ahidden_dim': 111.0, 'Anum_layers': 5.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.15803213039171066, 'Chidden_dim': 15.0, 'Cnum_layers': 2.0, 'Costact_fn': 'relu', 'Costdr': 0.039292815535570846, 'Costhidden_dim': 100.0, 'Costnum_layers': 7.0, 'PPO_epochs': 40.0, 'alr': 0.055208879690684894, 'clr': 0.0836998322355585, 'costlr': 0.01998298715034901, 'ent_coeff': 0.03909593488288565, 'gamma': 0.9670260423718355, 'val_coeff': 0.6965636370083746}
 32%|███▏      | 16/50 [02:05<04:38,  8.20s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 violations ::: 
0
 32%|███▏      | 16/50 [02:08<04:38,  8.20s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:856334.7200794902
Sharpe: 
0.6200385441502504
hit end!
[2.2224926e+08]
{'Aact_fn': 'relu', 'Adr': 0.381489133220854, 'Ahidden_dim': 50.0, 'Anum_layers': 5.0, 'Cact_fn': 'relu', 'Cdr': 0.16554651132371645, 'Chidden_dim': 77.0, 'Cnum_layers': 2.0, 'Costact_fn': 'tanh', 'Costdr': 0.4954672921708609, 'Costhidden_dim': 171.0, 'Costnum_layers': 6.0, 'PPO_epochs': 40.0, 'alr': 0.0010909007867083295, 'clr': 0.021525634165090974, 'costlr': 0.02130606841327478, 'ent_coeff': 0.06014605236700256, 'gamma': 0.9773985793799449, 'val_coeff': 0.9119116772685536}
 34%|███▍      | 17/50 [02:11<04:08,  7.52s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 34%|███▍      | 17/50 [02:15<04:08,  7.52s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:859349.1660851578
Sharpe: 
0.6448798082759072
hit end!
[2.251414e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.07412994955407687, 'Ahidden_dim': 11.0, 'Anum_layers': 8.0, 'Cact_fn': 'tanh', 'Cdr': 0.19749045864049003, 'Chidden_dim': 230.0, 'Cnum_layers': 2.0, 'Costact_fn': 'tanh', 'Costdr': 0.1204509894339148, 'Costhidden_dim': 111.0, 'Costnum_layers': 4.0, 'PPO_epochs': 20.0, 'alr': 0.0549790004328749, 'clr': 0.0016398434610126306, 'costlr': 0.01248817435009648, 'ent_coeff': 0.03429986875976132, 'gamma': 0.9525229328849771, 'val_coeff': 0.9055515128479272}
 36%|███▌      | 18/50 [02:18<03:55,  7.37s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 36%|███▌      | 18/50 [02:22<03:55,  7.37s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:861874.855619494
Sharpe: 
0.6713107071998771
hit end!
[2.2485208e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.22111484401621145, 'Ahidden_dim': 64.0, 'Anum_layers': 7.0, 'Cact_fn': 'relu', 'Cdr': 0.14381631856749272, 'Chidden_dim': 170.0, 'Cnum_layers': 3.0, 'Costact_fn': 'relu', 'Costdr': 0.41102846626803974, 'Costhidden_dim': 162.0, 'Costnum_layers': 3.0, 'PPO_epochs': 35.0, 'alr': 0.003777586694306078, 'clr': 0.30642783058649076, 'costlr': 0.008529909864046538, 'ent_coeff': 0.028466499456428422, 'gamma': 0.9798241163777094, 'val_coeff': 0.8429605683583059}
 38%|███▊      | 19/50 [02:25<03:41,  7.14s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 38%|███▊      | 19/50 [02:29<03:41,  7.14s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:887048.2137400723
Sharpe: 
0.8394135358703833
hit end!
[2.2618837e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.03829028094551842, 'Ahidden_dim': 85.0, 'Anum_layers': 7.0, 'Cact_fn': 'relu', 'Cdr': 0.040511625956086206, 'Chidden_dim': 105.0, 'Cnum_layers': 8.0, 'Costact_fn': 'relu', 'Costdr': 0.42022762973150113, 'Costhidden_dim': 248.0, 'Costnum_layers': 2.0, 'PPO_epochs': 35.0, 'alr': 0.0014656456402906877, 'clr': 0.0032987898469708705, 'costlr': 0.0011507816646184233, 'ent_coeff': 0.013834762489514124, 'gamma': 0.9880541343986962, 'val_coeff': 0.6192508097817462}
 40%|████      | 20/50 [02:32<03:39,  7.32s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 40%|████      | 20/50 [02:36<03:39,  7.32s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:859674.20872132
Sharpe: 
0.664308984999411
hit end!
[2.2549907e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.13595656708544598, 'Ahidden_dim': 77.0, 'Anum_layers': 7.0, 'Cact_fn': 'relu', 'Cdr': 0.004502571874019312, 'Chidden_dim': 179.0, 'Cnum_layers': 7.0, 'Costact_fn': 'relu', 'Costdr': 0.3879549063448917, 'Costhidden_dim': 215.0, 'Costnum_layers': 3.0, 'PPO_epochs': 35.0, 'alr': 0.0023415298039192347, 'clr': 0.00044472025691256805, 'costlr': 0.000399545260137225, 'ent_coeff': 0.04665467176292072, 'gamma': 0.9887298415306042, 'val_coeff': 0.575658893273147}
 42%|████▏     | 21/50 [02:41<03:44,  7.73s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 42%|████▏     | 21/50 [02:44<03:44,  7.73s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:880032.2286711881
Sharpe: 
0.7969740466649389
hit end!
[2.2733677e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.024154243059955866, 'Ahidden_dim': 182.0, 'Anum_layers': 8.0, 'Cact_fn': 'relu', 'Cdr': 0.23023541252719418, 'Chidden_dim': 106.0, 'Cnum_layers': 3.0, 'Costact_fn': 'relu', 'Costdr': 0.17935574580728603, 'Costhidden_dim': 226.0, 'Costnum_layers': 3.0, 'PPO_epochs': 45.0, 'alr': 0.0009652288698903267, 'clr': 0.001066942969962437, 'costlr': 0.28908707947593093, 'ent_coeff': 0.07711093626428767, 'gamma': 0.9824396590234076, 'val_coeff': 0.9953131166534643}
 44%|████▍     | 22/50 [02:48<03:33,  7.62s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 44%|████▍     | 22/50 [02:52<03:33,  7.62s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:881586.1902649538
Sharpe: 
0.811334676927172
hit end!
[2.275127e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.18488789850347376, 'Ahidden_dim': 105.0, 'Anum_layers': 7.0, 'Cact_fn': 'tanh', 'Cdr': 0.3813000136511133, 'Chidden_dim': 254.0, 'Cnum_layers': 6.0, 'Costact_fn': 'relu', 'Costdr': 0.007179249883953381, 'Costhidden_dim': 187.0, 'Costnum_layers': 2.0, 'PPO_epochs': 35.0, 'alr': 0.00036839409816840886, 'clr': 0.004748613156939573, 'costlr': 0.002631748429646276, 'ent_coeff': 0.06341199714658041, 'gamma': 0.9462858527899447, 'val_coeff': 0.8376013230830308}
 46%|████▌     | 23/50 [02:57<03:36,  8.00s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 46%|████▌     | 23/50 [03:01<03:36,  8.00s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:842592.232242696
Sharpe: 
0.5312243558709604
hit end!
[2.2187326e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.00448901488376427, 'Ahidden_dim': 53.0, 'Anum_layers': 6.0, 'Cact_fn': 'relu', 'Cdr': 0.3573644222724226, 'Chidden_dim': 195.0, 'Cnum_layers': 3.0, 'Costact_fn': 'relu', 'Costdr': 0.497421792463012, 'Costhidden_dim': 237.0, 'Costnum_layers': 3.0, 'PPO_epochs': 45.0, 'alr': 0.0017450410847565088, 'clr': 0.011385636690569443, 'costlr': 0.020821451960414485, 'ent_coeff': 0.047561986662818675, 'gamma': 0.97342264211318, 'val_coeff': 0.8357091658517036}
 48%|████▊     | 24/50 [03:06<03:36,  8.34s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 48%|████▊     | 24/50 [03:10<03:36,  8.34s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:843029.2909780913
Sharpe: 
0.5447650793485441
hit end!
[2.2190493e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.11663387339529287, 'Ahidden_dim': 129.0, 'Anum_layers': 8.0, 'Cact_fn': 'tanh', 'Cdr': 0.2552610664240856, 'Chidden_dim': 163.0, 'Cnum_layers': 6.0, 'Costact_fn': 'relu', 'Costdr': 0.21038198017894025, 'Costhidden_dim': 194.0, 'Costnum_layers': 5.0, 'PPO_epochs': 30.0, 'alr': 0.0007677162542832253, 'clr': 0.0027477828445451285, 'costlr': 0.0007191678255160823, 'ent_coeff': 0.07277918143210255, 'gamma': 0.9392586191824127, 'val_coeff': 0.6331110590388895}
 50%|█████     | 25/50 [03:14<03:19,  8.00s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 50%|█████     | 25/50 [03:18<03:19,  8.00s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:828084.047942271
Sharpe: 
0.43042072293325817
hit end!
[2.1931938e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.053963326512333404, 'Ahidden_dim': 215.0, 'Anum_layers': 7.0, 'Cact_fn': 'relu', 'Cdr': 0.0848096017298273, 'Chidden_dim': 106.0, 'Cnum_layers': 7.0, 'Costact_fn': 'relu', 'Costdr': 0.4154190998806205, 'Costhidden_dim': 160.0, 'Costnum_layers': 1.0, 'PPO_epochs': 35.0, 'alr': 0.0020106836786045468, 'clr': 0.0007021489516326697, 'costlr': 0.006116462381094554, 'ent_coeff': 0.05548931151640703, 'gamma': 0.9522502747143612, 'val_coeff': 0.5419255250736243}
 52%|█████▏    | 26/50 [03:23<03:19,  8.32s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 52%|█████▏    | 26/50 [03:26<03:19,  8.32s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:845751.8865301943
Sharpe: 
0.5635428146771769
hit end!
[2.2304699e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.1644451602782674, 'Ahidden_dim': 65.0, 'Anum_layers': 4.0, 'Cact_fn': 'tanh', 'Cdr': 0.43706707250420124, 'Chidden_dim': 254.0, 'Cnum_layers': 5.0, 'Costact_fn': 'relu', 'Costdr': 0.1455131085626063, 'Costhidden_dim': 206.0, 'Costnum_layers': 2.0, 'PPO_epochs': 45.0, 'alr': 0.027473309726076766, 'clr': 0.006696477136603977, 'costlr': 0.013071386639800762, 'ent_coeff': 0.010129632202963743, 'gamma': 0.9851123014073374, 'val_coeff': 0.8094750030617207}
 54%|█████▍    | 27/50 [03:32<03:19,  8.65s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 54%|█████▍    | 27/50 [03:36<03:19,  8.65s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:880253.2935837547
Sharpe: 
0.7979909654877367
hit end!
[2.2681674e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.30113856718732734, 'Ahidden_dim': 28.0, 'Anum_layers': 6.0, 'Cact_fn': 'relu', 'Cdr': 0.25974975787645116, 'Chidden_dim': 151.0, 'Cnum_layers': 3.0, 'Costact_fn': 'relu', 'Costdr': 0.2871941678699816, 'Costhidden_dim': 176.0, 'Costnum_layers': 5.0, 'PPO_epochs': 50.0, 'alr': 0.003670487454345124, 'clr': 0.04323675678213453, 'costlr': 0.0035737673422994057, 'ent_coeff': 0.02840293418609416, 'gamma': 0.9705327764993812, 'val_coeff': 0.9918330482964844}
 56%|█████▌    | 28/50 [03:39<03:00,  8.22s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 56%|█████▌    | 28/50 [03:43<03:00,  8.22s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:850769.6690314991
Sharpe: 
0.5995480768221556
hit end!
[2.2363725e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.2166096159885391, 'Ahidden_dim': 141.0, 'Anum_layers': 7.0, 'Cact_fn': 'tanh', 'Cdr': 0.010466760479416304, 'Chidden_dim': 118.0, 'Cnum_layers': 8.0, 'Costact_fn': 'relu', 'Costdr': 0.37471129879334963, 'Costhidden_dim': 230.0, 'Costnum_layers': 3.0, 'PPO_epochs': 35.0, 'alr': 0.0003857567435374549, 'clr': 0.0007096657987156593, 'costlr': 0.0739548245791528, 'ent_coeff': 0.04108954241607278, 'gamma': 0.980782032590796, 'val_coeff': 0.6601211908996809}
 58%|█████▊    | 29/50 [03:47<02:47,  7.97s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 58%|█████▊    | 29/50 [03:50<02:47,  7.97s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:869264.5053462073
Sharpe: 
0.7263165474376765
hit end!
[2.2587474e+08]
 60%|██████    | 30/50 [03:56<02:45,  8.26s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



{'Aact_fn': 'sigmoid', 'Adr': 0.4814644364473647, 'Ahidden_dim': 97.0, 'Anum_layers': 8.0, 'Cact_fn': 'tanh', 'Cdr': 0.3423356114570213, 'Chidden_dim': 180.0, 'Cnum_layers': 6.0, 'Costact_fn': 'relu', 'Costdr': 0.22060642071192244, 'Costhidden_dim': 147.0, 'Costnum_layers': 2.0, 'PPO_epochs': 40.0, 'alr': 0.007019343763931229, 'clr': 0.3491258138867833, 'costlr': 0.24278662948817303, 'ent_coeff': 0.022340236496607744, 'gamma': 0.9013501498911909, 'val_coeff': 0.800091429288754}
 violations ::: 
0
 60%|██████    | 30/50 [04:00<02:45,  8.26s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:879037.5534858562
Sharpe: 
0.7918805234565961
hit end!
[2.2667301e+08]
{'Aact_fn': 'tanh', 'Adr': 0.21326682277810155, 'Ahidden_dim': 121.0, 'Anum_layers': 6.0, 'Cact_fn': 'relu', 'Cdr': 0.21667092473251512, 'Chidden_dim': 124.0, 'Cnum_layers': 1.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.15468127712000773, 'Costhidden_dim': 255.0, 'Costnum_layers': 1.0, 'PPO_epochs': 50.0, 'alr': 0.0025572342307846217, 'clr': 0.0025503719545758867, 'costlr': 0.002591138881068579, 'ent_coeff': 0.0884447620484409, 'gamma': 0.9581022342353276, 'val_coeff': 0.8748722584089157}
 62%|██████▏   | 31/50 [04:04<02:34,  8.15s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 62%|██████▏   | 31/50 [04:07<02:34,  8.15s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:851785.5409905588
Sharpe: 
0.5955249550459483
hit end!
[2.2197306e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.08136929994816228, 'Ahidden_dim': 164.0, 'Anum_layers': 7.0, 'Cact_fn': 'tanh', 'Cdr': 0.4954358375440171, 'Chidden_dim': 81.0, 'Cnum_layers': 5.0, 'Costact_fn': 'relu', 'Costdr': 0.4563711257532941, 'Costhidden_dim': 3.0, 'Costnum_layers': 1.0, 'PPO_epochs': 30.0, 'alr': 0.0007755789744154422, 'clr': 0.019221567916872857, 'costlr': 0.0008363261027493654, 'ent_coeff': 0.06064355813562899, 'gamma': 0.9392065438162083, 'val_coeff': 0.9575189220227637}
 64%|██████▍   | 32/50 [04:11<02:22,  7.92s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 64%|██████▍   | 32/50 [04:14<02:22,  7.92s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:880361.4098824718
Sharpe: 
0.7902201514280235
hit end!
[2.2745573e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.11912816078487788, 'Ahidden_dim': 144.0, 'Anum_layers': 4.0, 'Cact_fn': 'relu', 'Cdr': 0.10914297872004798, 'Chidden_dim': 203.0, 'Cnum_layers': 3.0, 'Costact_fn': 'relu', 'Costdr': 0.301244089290744, 'Costhidden_dim': 194.0, 'Costnum_layers': 6.0, 'PPO_epochs': 40.0, 'alr': 0.0013127811135331922, 'clr': 0.04704057280546318, 'costlr': 0.00044281775019075796, 'ent_coeff': 0.052186026200085375, 'gamma': 0.9785085389886231, 'val_coeff': 0.5876847810681975}
 66%|██████▌   | 33/50 [04:19<02:16,  8.02s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 66%|██████▌   | 33/50 [04:24<02:16,  8.02s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:858795.5471070705
Sharpe: 
0.6407608556985657
hit end!
[2.2529792e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.0031727475322377297, 'Ahidden_dim': 39.0, 'Anum_layers': 6.0, 'Cact_fn': 'tanh', 'Cdr': 0.4054107209135403, 'Chidden_dim': 164.0, 'Cnum_layers': 1.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.3354388008084854, 'Costhidden_dim': 58.0, 'Costnum_layers': 5.0, 'PPO_epochs': 45.0, 'alr': 0.0051267049580492295, 'clr': 0.00663881047940142, 'costlr': 0.006413485908465571, 'ent_coeff': 0.04383294208069613, 'gamma': 0.9313202715195267, 'val_coeff': 0.8736524819081819}
 68%|██████▊   | 34/50 [04:28<02:09,  8.09s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 68%|██████▊   | 34/50 [04:31<02:09,  8.09s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:875738.9789339314
Sharpe: 
0.7809531978378624
hit end!
[2.2652296e+08]
{'Aact_fn': 'relu', 'Adr': 0.30682940421402033, 'Ahidden_dim': 96.0, 'Anum_layers': 3.0, 'Cact_fn': 'relu', 'Cdr': 0.049726389949970154, 'Chidden_dim': 193.0, 'Cnum_layers': 7.0, 'Costact_fn': 'relu', 'Costdr': 0.17124519679037073, 'Costhidden_dim': 33.0, 'Costnum_layers': 3.0, 'PPO_epochs': 25.0, 'alr': 0.0005158679640389777, 'clr': 0.0016690132533975928, 'costlr': 0.026083353929251814, 'ent_coeff': 0.07343079562086786, 'gamma': 0.9684597688516442, 'val_coeff': 0.8033154447782831}
 70%|███████   | 35/50 [04:35<01:57,  7.86s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 70%|███████   | 35/50 [04:38<01:57,  7.86s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:865209.0833034955
Sharpe: 
0.6916657119846588
hit end!
[2.2437237e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.45416712193190034, 'Ahidden_dim': 192.0, 'Anum_layers': 7.0, 'Cact_fn': 'tanh', 'Cdr': 0.2749272409237608, 'Chidden_dim': 210.0, 'Cnum_layers': 4.0, 'Costact_fn': 'relu', 'Costdr': 0.04615867284125548, 'Costhidden_dim': 165.0, 'Costnum_layers': 1.0, 'PPO_epochs': 5.0, 'alr': 0.246304749924112, 'clr': 0.015591634899443518, 'costlr': 0.014488714605481854, 'ent_coeff': 0.037728722227422046, 'gamma': 0.9522146923813087, 'val_coeff': 0.5276860162090399}
 72%|███████▏  | 36/50 [04:41<01:44,  7.49s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 72%|███████▏  | 36/50 [04:46<01:44,  7.49s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:853144.143537551
Sharpe: 
0.601831306224786
hit end!
[2.2242387e+08]
{'Aact_fn': 'tanh', 'Adr': 0.24022294740662517, 'Ahidden_dim': 175.0, 'Anum_layers': 8.0, 'Cact_fn': 'relu', 'Cdr': 0.3136306729510787, 'Chidden_dim': 90.0, 'Cnum_layers': 8.0, 'Costact_fn': 'relu', 'Costdr': 0.26773188335533543, 'Costhidden_dim': 141.0, 'Costnum_layers': 2.0, 'PPO_epochs': 30.0, 'alr': 0.0026693340355279957, 'clr': 0.004429127238702639, 'costlr': 0.0027650246756467834, 'ent_coeff': 0.0860780739658362, 'gamma': 0.9754712442824166, 'val_coeff': 0.6794064088949686}
 74%|███████▍  | 37/50 [04:51<01:44,  8.08s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 74%|███████▍  | 37/50 [04:55<01:44,  8.08s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:843176.0465954454
Sharpe: 
0.5389320288968363
hit end!
[2.2298925e+08]
 76%|███████▌  | 38/50 [04:59<01:36,  8.06s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



{'Aact_fn': 'relu', 'Adr': 0.14398784649310198, 'Ahidden_dim': 69.0, 'Anum_layers': 6.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.38017815223497764, 'Chidden_dim': 129.0, 'Cnum_layers': 3.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.0004171974985628657, 'Costhidden_dim': 120.0, 'Costnum_layers': 3.0, 'PPO_epochs': 45.0, 'alr': 0.009885397676250168, 'clr': 0.0003381082511567805, 'costlr': 0.007436935067782817, 'ent_coeff': 0.024008111363455938, 'gamma': 0.9655050437016424, 'val_coeff': 0.6313276082107493}
 violations ::: 
1
 76%|███████▌  | 38/50 [05:02<01:36,  8.06s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:817905.8492271116
Sharpe: 
0.36562986051671814
hit end!
[2.1871003e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.33681504175407023, 'Ahidden_dim': 155.0, 'Anum_layers': 4.0, 'Cact_fn': 'tanh', 'Cdr': 0.4164586249563645, 'Chidden_dim': 223.0, 'Cnum_layers': 4.0, 'Costact_fn': 'relu', 'Costdr': 0.08831738947623452, 'Costhidden_dim': 181.0, 'Costnum_layers': 1.0, 'PPO_epochs': 40.0, 'alr': 0.0050855351658645066, 'clr': 0.3653309006905451, 'costlr': 0.03840688506696176, 'ent_coeff': 0.017613970243967533, 'gamma': 0.962317253929928, 'val_coeff': 0.8629570531713908}
 78%|███████▊  | 39/50 [05:06<01:24,  7.67s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 78%|███████▊  | 39/50 [05:09<01:24,  7.67s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:872428.4752400889
Sharpe: 
0.7446823108109517
hit end!
[2.2677456e+08]
{'Aact_fn': 'tanh', 'Adr': 0.1847753298173017, 'Ahidden_dim': 131.0, 'Anum_layers': 8.0, 'Cact_fn': 'relu', 'Cdr': 0.1827542891361897, 'Chidden_dim': 52.0, 'Cnum_layers': 7.0, 'Costact_fn': 'tanh', 'Costdr': 0.35882546933147574, 'Costhidden_dim': 204.0, 'Costnum_layers': 4.0, 'PPO_epochs': 25.0, 'alr': 0.026416465303715532, 'clr': 0.15215028370320235, 'costlr': 0.15878612568906228, 'ent_coeff': 0.0799264202153017, 'gamma': 0.9328900407545841, 'val_coeff': 0.6996599018909796}
 80%|████████  | 40/50 [05:14<01:19,  7.91s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 80%|████████  | 40/50 [05:18<01:19,  7.91s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:851255.5402360081
Sharpe: 
0.6005967673806942
hit end!
[2.233294e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.06480036563049023, 'Ahidden_dim': 26.0, 'Anum_layers': 2.0, 'Cact_fn': 'tanh', 'Cdr': 0.12936728648768828, 'Chidden_dim': 153.0, 'Cnum_layers': 5.0, 'Costact_fn': 'relu', 'Costdr': 0.22957793357377457, 'Costhidden_dim': 242.0, 'Costnum_layers': 2.0, 'PPO_epochs': 50.0, 'alr': 0.0031062145941959385, 'clr': 0.030659863088941804, 'costlr': 0.009214063892581122, 'ent_coeff': 0.052390944007989035, 'gamma': 0.9600950869204277, 'val_coeff': 0.7807850584774108}
 82%|████████▏ | 41/50 [05:21<01:08,  7.65s/trial, best loss: -0.8800923337843639]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 82%|████████▏ | 41/50 [05:25<01:08,  7.65s/trial, best loss: -0.8800923337843639]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:898420.6474906552
Sharpe: 
0.9275925440937247
hit end!
[2.3103158e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.06307886263543938, 'Ahidden_dim': 236.0, 'Anum_layers': 2.0, 'Cact_fn': 'tanh', 'Cdr': 0.49578220143364465, 'Chidden_dim': 243.0, 'Cnum_layers': 6.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.2328134778627613, 'Costhidden_dim': 240.0, 'Costnum_layers': 8.0, 'PPO_epochs': 50.0, 'alr': 0.00044589564678289376, 'clr': 0.033195284576825405, 'costlr': 0.0017160569440880153, 'ent_coeff': 0.06453685732586678, 'gamma': 0.955181407381568, 'val_coeff': 0.7366834833556547}
 84%|████████▍ | 42/50 [05:29<01:00,  7.62s/trial, best loss: -0.9275925440937247]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 84%|████████▍ | 42/50 [05:32<01:00,  7.62s/trial, best loss: -0.9275925440937247]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:877102.3433139422
Sharpe: 
0.7659766452322465
hit end!
[2.2520379e+08]
{'Aact_fn': 'relu', 'Adr': 0.09167715237700624, 'Ahidden_dim': 209.0, 'Anum_layers': 1.0, 'Cact_fn': 'tanh', 'Cdr': 0.10366020507295554, 'Chidden_dim': 154.0, 'Cnum_layers': 5.0, 'Costact_fn': 'tanh', 'Costdr': 0.06417408738168762, 'Costhidden_dim': 221.0, 'Costnum_layers': 1.0, 'PPO_epochs': 45.0, 'alr': 0.0006307374557838481, 'clr': 0.009214955766907052, 'costlr': 0.005022049021160749, 'ent_coeff': 0.051912514030726234, 'gamma': 0.9420460149707713, 'val_coeff': 0.7851299120548124}
 86%|████████▌ | 43/50 [05:39<00:57,  8.28s/trial, best loss: -0.9275925440937247]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 86%|████████▌ | 43/50 [05:42<00:57,  8.28s/trial, best loss: -0.9275925440937247]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:894021.7654501856
Sharpe: 
0.8700709129016648
hit end!
[2.3007986e+08]
{'Aact_fn': 'tanh', 'Adr': 0.03389577239357504, 'Ahidden_dim': 255.0, 'Anum_layers': 3.0, 'Cact_fn': 'tanh', 'Cdr': 0.23224599740723187, 'Chidden_dim': 61.0, 'Cnum_layers': 7.0, 'Costact_fn': 'relu', 'Costdr': 0.11098588346723373, 'Costhidden_dim': 245.0, 'Costnum_layers': 2.0, 'PPO_epochs': 50.0, 'alr': 0.013117907356735558, 'clr': 0.06273254210826128, 'costlr': 0.05898007358521277, 'ent_coeff': 0.09447001017120288, 'gamma': 0.9259331580011981, 'val_coeff': 0.5993824189733821}
 88%|████████▊ | 44/50 [05:45<00:47,  7.86s/trial, best loss: -0.9275925440937247]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 88%|████████▊ | 44/50 [05:50<00:47,  7.86s/trial, best loss: -0.9275925440937247]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:852269.9489587676
Sharpe: 
0.6003622095890317
hit end!
[2.2421442e+08]
{'Aact_fn': 'relu', 'Adr': 0.017787929038546993, 'Ahidden_dim': 192.0, 'Anum_layers': 3.0, 'Cact_fn': 'tanh', 'Cdr': 0.3593546679984997, 'Chidden_dim': 188.0, 'Cnum_layers': 8.0, 'Costact_fn': 'relu', 'Costdr': 0.31241345056951253, 'Costhidden_dim': 254.0, 'Costnum_layers': 2.0, 'PPO_epochs': 50.0, 'alr': 0.0011305309222913254, 'clr': 0.02933998613633101, 'costlr': 0.0035580657819749293, 'ent_coeff': 0.05832141636560958, 'gamma': 0.9612382159152294, 'val_coeff': 0.6544436401831679}
 90%|█████████ | 45/50 [05:53<00:39,  7.86s/trial, best loss: -0.9275925440937247]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 90%|█████████ | 45/50 [05:57<00:39,  7.86s/trial, best loss: -0.9275925440937247]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:877753.1889071495
Sharpe: 
0.7858027555022303
hit end!
[2.2663611e+08]
                                                                                  

  x = self.model(torch.tensor(state))



{'Aact_fn': 'sigmoid', 'Adr': 0.10598965419707754, 'Ahidden_dim': 151.0, 'Anum_layers': 2.0, 'Cact_fn': 'tanh', 'Cdr': 0.03984552564434608, 'Chidden_dim': 27.0, 'Cnum_layers': 6.0, 'Costact_fn': 'tanh', 'Costdr': 0.13952884473792337, 'Costhidden_dim': 231.0, 'Costnum_layers': 1.0, 'PPO_epochs': 15.0, 'alr': 0.0007918147406022531, 'clr': 0.12066742280611496, 'costlr': 0.015427269753709867, 'ent_coeff': 0.0711516469905413, 'gamma': 0.9087694525926583, 'val_coeff': 0.6947360062838909}
 violations ::: 
0
 92%|█████████▏| 46/50 [06:05<00:31,  7.85s/trial, best loss: -0.9275925440937247]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:841695.0044823284
Sharpe: 
0.5317358142773084
hit end!
[2.2353965e+08]
{'Aact_fn': 'tanh', 'Adr': 0.1622224066883894, 'Ahidden_dim': 26.0, 'Anum_layers': 1.0, 'Cact_fn': 'sigmoid', 'Cdr': 0.12530836443803742, 'Chidden_dim': 138.0, 'Cnum_layers': 5.0, 'Costact_fn': 'sigmoid', 'Costdr': 0.26519058859873856, 'Costhidden_dim': 200.0, 'Costnum_layers': 1.0, 'PPO_epochs': 45.0, 'alr': 0.0016147661161936655, 'clr': 0.05761074676270496, 'costlr': 0.009760988126160287, 'ent_coeff': 0.09809673414768885, 'gamma': 0.9485107942516954, 'val_coeff': 0.7441186921425519}
 94%|█████████▍| 47/50 [06:09<00:23,  7.91s/trial, best loss: -0.9275925440937247]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 94%|█████████▍| 47/50 [06:12<00:23,  7.91s/trial, best loss: -0.9275925440937247]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:837885.0967945751
Sharpe: 
0.49896097612301626
hit end!
[2.1974589e+08]
{'Aact_fn': 'sigmoid', 'Adr': 0.13379860719986741, 'Ahidden_dim': 120.0, 'Anum_layers': 2.0, 'Cact_fn': 'tanh', 'Cdr': 0.4423151005009963, 'Chidden_dim': 211.0, 'Cnum_layers': 4.0, 'Costact_fn': 'relu', 'Costdr': 0.18888184681366904, 'Costhidden_dim': 124.0, 'Costnum_layers': 5.0, 'PPO_epochs': 40.0, 'alr': 0.12123420394587027, 'clr': 0.0257811152405657, 'costlr': 0.08485145332591788, 'ent_coeff': 0.08687266196653003, 'gamma': 0.9336714487567987, 'val_coeff': 0.5645870252713818}
 96%|█████████▌| 48/50 [06:17<00:15,  7.80s/trial, best loss: -0.9275925440937247]

  x = self.model(torch.tensor(state))



 violations ::: 
1
 96%|█████████▌| 48/50 [06:20<00:15,  7.80s/trial, best loss: -0.9275925440937247]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
 violations ::: 
1
begin_total_asset:1000000
end_total_asset:886016.776798709
Sharpe: 
0.8337459999224861
hit end!
[2.2886163e+08]
{'Aact_fn': 'relu', 'Adr': 0.05649737481932736, 'Ahidden_dim': 2.0, 'Anum_layers': 4.0, 'Cact_fn': 'tanh', 'Cdr': 0.3179091385026155, 'Chidden_dim': 92.0, 'Cnum_layers': 7.0, 'Costact_fn': 'tanh', 'Costdr': 0.23107470496832477, 'Costhidden_dim': 69.0, 'Costnum_layers': 4.0, 'PPO_epochs': 5.0, 'alr': 0.02116783510994095, 'clr': 0.015742317903241945, 'costlr': 0.02744008021438653, 'ent_coeff': 0.0766420865197262, 'gamma': 0.9711755334766529, 'val_coeff': 0.7168914437962098}
 98%|█████████▊| 49/50 [06:24<00:07,  7.61s/trial, best loss: -0.9275925440937247]

  x = self.model(torch.tensor(state))



 violations ::: 
0
 98%|█████████▊| 49/50 [06:28<00:07,  7.61s/trial, best loss: -0.9275925440937247]

  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')

  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost

  return F.mse_loss(input, target, reduction=self.reduction)

  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)



 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
 violations ::: 
0
begin_total_asset:1000000
end_total_asset:865212.2391142743
Sharpe: 
0.6920900317004034
hit end!
[2.2519674e+08]
100%|██████████| 50/50 [06:31<00:00,  7.83s/trial, best loss: -0.9275925440937247]


  x = self.model(torch.tensor(state))



In [40]:
best['Aact_fn'] = ['relu', 'tanh', 'sigmoid'][best['Aact_fn']]
best['Cact_fn'] = ['relu', 'tanh', 'sigmoid'][best['Cact_fn']]
best = {k: v.item() if isinstance(v, (np.floating, np.integer)) else v for k, v in best.items()}
best

{'Aact_fn': 'sigmoid',
 'Adr': 0.06480036563049023,
 'Ahidden_dim': 26.0,
 'Anum_layers': 2.0,
 'Cact_fn': 'tanh',
 'Cdr': 0.12936728648768828,
 'Chidden_dim': 153.0,
 'Cnum_layers': 5.0,
 'Costact_fn': 0,
 'Costdr': 0.22957793357377457,
 'Costhidden_dim': 242.0,
 'Costnum_layers': 2.0,
 'PPO_epochs': 50.0,
 'alr': 0.0031062145941959385,
 'clr': 0.030659863088941804,
 'costlr': 0.009214063892581122,
 'ent_coeff': 0.052390944007989035,
 'gamma': 0.9600950869204277,
 'val_coeff': 0.7807850584774108}

In [41]:
# best = {'Aact_fn': 'tanh',
#  'Adr': 0.005606730178782146,
#  'Ahidden_dim': 110.0,
#  'Anum_layers': 8.0,
#  'Cact_fn': 'tanh',
#  'Cdr': 0.19701572801150505,
#  'Chidden_dim': 208.0,
#  'Cnum_layers': 2.0,
#  'Costact_fn': 2,
#  'Costdr': 0.20250081379537282,
#  'Costhidden_dim': 8.0,
#  'Costnum_layers': 5.0,
#  'PPO_epochs': 10.0,
#  'alr': 0.0003383195571517458,
#  'clr': 0.01752598526985451,
#  'costlr': 0.015521085366598176,
#  'ent_coeff': 0.07534438311556774,
#  'gamma': 0.952657004986277,
#  'val_coeff': 0.9939667210432088}

In [42]:
agent = PPOagent(env_full_train, best)

In [None]:
env = env_full_train
Episode_rewards = []
Avg_rewards = []
num_episodes = 100

for episode in range(num_episodes):
  Actions = []
  States = []
  Rewards = []
  Log_probs = []
  Values = []

  state = env.reset()
  done = False
  episode_reward = 0

  while not done:
    state_tensor = torch.FloatTensor(state).to(device)


    action, logs_probs, entropy = agent.get_action(state_tensor)
    state_value = agent.critic.forward(state_tensor, action).to(device)
    States.append(state_tensor)
    Values.append(state_value)
    action = action.cpu()
    next_state, reward, done, _ = env.step(action.detach().numpy())

    Actions.append(action)
    Rewards.append(reward)
    Log_probs.append(logs_probs)

    state = next_state
    episode_reward += reward

  actions = torch.cat(Actions).to(device)
  states = torch.cat(States).to(device)
  values = torch.cat(Values).to(device)   #.squeeze(-1)
  log_prob_old = torch.cat(Log_probs).to(device)
  returns = (agent.calculate_returns(Rewards, agent.gamma)).to(device)
  advantages = (returns - values).to(device)
  advantages = ((advantages - advantages.mean()) / (advantages.std() + 1e-5)).to(device)

  agent.update_policy(states, actions, log_prob_old, advantages, returns)

  ## update lambda and rho --
  agent.lambda_ = agent.lambda_ + agent.rho * agent.cost_network.forward(
        state_tensor,
        action
    ).mean().detach().to(device)

  agent.rho= agent.rho * 1.008

  Episode_rewards.append(episode_reward)
  Avg_rewards.append(np.mean(Episode_rewards[-10:]))

  print(f"Episode: {episode+1}, Episode Reward: {episode_reward}")



  x = self.model(torch.tensor(state))


begin_total_asset:1000000
end_total_asset:647872.6343169255
Sharpe:  1.3588472142963606


  next_cost = self.cost_target.forward(torch.tensor(next_states) ,torch.tensor(next_actions))  # c'_wv'(s', a')
  cost_target = self.VaR(torch.tensor(next_states) ,torch.tensor(next_actions) ) + self.gamma  * next_cost


 violations :::  1


  return F.mse_loss(input, target, reduction=self.reduction)
  next_action_tensor = torch.tensor(next_actions, dtype=torch.float32).to(device)


 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
Episode: 1, Episode Reward: [2.3982333e+09]
begin_total_asset:1000000
end_total_asset:571516.4027879458
Sharpe:  1.271902577900852
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
Episode: 2, Episode Reward: [2.2693092e+09]
begin_total_asset:1000000
end_total_asset:603028.4112971505
Sharpe:  1.3078042499374052
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
 violations :::  1
Episode: 3, Episode Reward: [2.2369224e+09]
begin_total_asset:1000000
end_total_asset:647179.8457689072
Sharpe:  1.3423864443613152
 violations :::  1
 violations :::  1
 violations ::: 

In [None]:
%matplotlib inline
plt.plot(Episode_rewards)
plt.plot(Avg_rewards)
plt.plot()
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.show();

## Trading
Assume that we have $1,000,000 initial capital at 2019-01-01. We use the A2C model to trade Dow jones 30 stocks.

In [None]:
e_trade_gym = StockPortfolioEnv(df = trade, **trade_env_kwargs)
test_env, test_obs = e_trade_gym.get_sb_env()

In [None]:
account_memory, actions_memory, rewardd = agent.trade(env_trade, e_trade_gym, None)
print(rewardd)

In [None]:
calculate_sharpe(account_memory[0])

In [None]:
from google.colab import files
account_memory[0].to_csv('df_daily_return_'+ Market +'_' + Reward +'.csv')
files.download('df_daily_return_'+ Market +'_' + Reward +'.csv')

In [None]:
actions_memory[0].head()

In [None]:
actions_memory[0].to_csv('df_actions_Trade_'+ Market +'_' + Reward +'.csv')
files.download('df_actions_Trade_'+ Market +'_' + Reward +'.csv')

In [None]:
df_daily_return = account_memory[0]

In [None]:
e_trade_gym = StockPortfolioEnv(df = trade, **env_kwargs)
test_env, test_obs = e_trade_gym.get_sb_env()

In [None]:
account_memory, actions_memory, rewardd = agent.trade(env_trade, e_trade_gym, None)
print(rewardd)

In [None]:
df_daily_return_T = account_memory[0]

In [None]:
from google.colab import files
df_daily_return_T.to_csv('df_daily_return_WT '+ Market +'_' + Reward +'.csv')
files.download('df_daily_return_WT '+ Market +'_' + Reward +'.csv')

<a id='6'></a>
# Part 7: Backtest Our Strategy
Backtesting plays a key role in evaluating the performance of a trading strategy. Automated backtesting tool is preferred because it reduces the human error. We usually use the Quantopian pyfolio package to backtest our trading strategies. It is easy to use and consists of various individual plots that provide a comprehensive image of the performance of a trading strategy.

<a id='6.1'></a>
## 7.1 BackTestStats
pass in df_account_value, this information is stored in env class
# Nifty 50 -- ^NSEI
# Sensex 30 -- ^BSESN
# Dow 30 -- ^DJI
# DAX 40 -- ^GDAXI
# HSI 30 -- ^HSI
# TIRKEY -- XU100.IS
# Nikeei -- ^N225
# IBEX Spain -- ^IBEX
# Tiwan -- ^TWII
# Nifty 100 -- ^CNX100

In [None]:
%%capture
!pip install numpy==1.26.4

In [None]:
from pyfolio import timeseries
DRL_strat = convert_daily_return_to_pyfolio_ts(df_daily_return_T)
perf_func = perf_stats
perf_stats_all = perf_func( returns=DRL_strat,
                              factor_returns=DRL_strat,
                                positions=None, transactions=None, turnover_denom="AGB")

In [None]:
print("==============DRL Strategy Stats===========")
print(perf_stats_all)

In [None]:
#baseline stats
print("==============Get Baseline Stats===========")
baseline_df = get_baseline(
        ticker= BL,
        start = df_daily_return.loc[0,'date'],
        end = df_daily_return.loc[len(df_daily_return)-1,'date'])

stats = backtest_stats(baseline_df, value_col_name = 'close')
print(stats)

<a id='6.2'></a>
## 7.2 BackTestPlot

In [None]:
pip install empyrical==0.3.4

In [None]:
import pyfolio
%matplotlib inline

baseline_df = get_baseline(
        ticker=BL, start=df_daily_return.loc[0,'date'], end='2025-02-28')

baseline_returns = get_daily_return(baseline_df, value_col_name="close")

# with pyfolio.plotting.plotting_context(font_scale=1.1):
#         pyfolio.create_full_tear_sheet(returns = DRL_strat,
#                                        benchmark_rets=baseline_returns, set_context=False)

In [None]:
DRL_strat

In [None]:
baseline_returns

In [None]:
baseline_returns.to_csv('Baseline_Daily_Return_'+ Market +'.csv')
files.download('Baseline_Daily_Return_'+ Market +'.csv')

## Min-Variance Portfolio Allocation

In [None]:
%pip install PyPortfolioOpt

In [None]:
from pypfopt.efficient_frontier import EfficientFrontier
from pypfopt import risk_models

In [None]:
unique_tic = trade.tic.unique()
unique_trade_date = trade.date.unique()

In [None]:
#calculate_portfolio_minimum_variance
portfolio = pd.DataFrame(index = range(1), columns = unique_trade_date)
initial_capital = 1000000
portfolio.loc[0,unique_trade_date[0]] = initial_capital

for i in range(len( unique_trade_date)-1):
    df_temp = df[df.date==unique_trade_date[i]].reset_index(drop=True)
    df_temp_next = df[df.date==unique_trade_date[i+1]].reset_index(drop=True)
    #Sigma = risk_models.sample_cov(df_temp.return_list[0])
    #calculate covariance matrix
    Sigma = df_temp.return_list[0].cov()
    #portfolio allocation
    ef_min_var = EfficientFrontier(None, Sigma,weight_bounds=(0, 0.1))
    #minimum variance
    raw_weights_min_var = ef_min_var.min_volatility()
    #get weights
    cleaned_weights_min_var = ef_min_var.clean_weights()

    #current capital
    cap = portfolio.iloc[0, i]
    #current cash invested for each stock
    current_cash = [element * cap for element in list(cleaned_weights_min_var.values())]
    # current held shares
    current_shares = list(np.array(current_cash)
                                      / np.array(df_temp.close))
    # next time period price
    next_price = np.array(df_temp_next.close)
    ##next_price * current share to calculate next total account value
    portfolio.iloc[0, i+1] = np.dot(current_shares, next_price)

portfolio=portfolio.T
portfolio.columns = ['account_value']

# Markowitz's with Transection Cost

In [None]:
#calculate_portfolio_minimum_variance
portfolio = pd.DataFrame(index = range(1), columns = unique_trade_date)
initial_capital = 1000000
portfolio.loc[0,unique_trade_date[0]] = initial_capital

# Define transaction cost rate
transaction_cost_rate = 0.005

for i in range(len( unique_trade_date)-1):
    df_temp = df[df.date==unique_trade_date[i]].reset_index(drop=True)
    df_temp_next = df[df.date==unique_trade_date[i+1]].reset_index(drop=True)
    #Sigma = risk_models.sample_cov(df_temp.return_list[0])
    #calculate covariance matrix
    Sigma = df_temp.return_list[0].cov()
    #portfolio allocation
    ef_min_var = EfficientFrontier(None, Sigma,weight_bounds=(0, 0.1))
    #minimum variance
    raw_weights_min_var = ef_min_var.min_volatility()
    #get weights
    cleaned_weights_min_var = ef_min_var.clean_weights()

    #current capital
    cap = portfolio.iloc[0, i]
    #current cash invested for each stock
    current_cash = [element * cap for element in list(cleaned_weights_min_var.values())]
    # current held shares
    current_shares = list(np.array(current_cash)
                                      / np.array(df_temp.close))
    # next time period price
    next_price = np.array(df_temp_next.close)

    # Calculate next portfolio value without transaction cost
    next_value = np.dot(current_shares, next_price)

    # Calculate transaction costs
    new_shares = current_cash / next_price
    share_differences = np.abs(new_shares - current_shares)
    transaction_cost = np.sum(share_differences * next_price * transaction_cost_rate)

    # Deduct transaction cost from portfolio value
    portfolio.iloc[0, i + 1] = next_value - transaction_cost

portfolio=portfolio.T
portfolio.columns = ['account_value']

In [None]:
def calculate_daily_return(current_value, previous_value):
    return (current_value - previous_value) / previous_value

# Calculate daily return and add it as a new column
daily_returns = [0]  # Daily return for the first day is assumed to be 0
for i in range(1, len(portfolio)):
    current_value = portfolio['account_value'][i]
    previous_value = portfolio['account_value'][i - 1]
    daily_returns.append(calculate_daily_return(current_value, previous_value))

portfolio['daily_return'] = daily_returns

print(portfolio)

In [None]:
portfolio.head()

In [None]:
Agent =(df_daily_return_T.daily_return+1).cumprod()-1

In [None]:
min_var_cumpod =(portfolio.account_value.pct_change()+1).cumprod()-1

In [None]:
portfolio.drop(columns=['account_value'], inplace=True)
portfolio.to_csv('Markowitz_Portfolio_Return_'+ Market +'.csv')
files.download('Markowitz_Portfolio_Return_'+ Market +'.csv')

In [None]:
Baseline =(baseline_returns+1).cumprod()-1

## Plotly: DRL, Min-Variance, DJIA

In [None]:
%pip install plotly

In [None]:
from datetime import datetime as dt

import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

In [None]:
time_ind = pd.Series(df_daily_return_T.date)

In [None]:
trace0_portfolio = go.Scatter(x = time_ind, y = Agent, mode = 'lines', name = 'Agent (Portfolio Allocation)')

trace1_portfolio = go.Scatter(x = time_ind, y = Baseline, mode = 'lines', name = 'Baseline')
trace2_portfolio = go.Scatter(x = time_ind, y = min_var_cumpod, mode = 'lines', name = 'Min-Variance')
#trace3_portfolio = go.Scatter(x = time_ind, y = a2c_cumpod_esg, mode = 'lines', name = 'ESG-A2C (Portfolio Allocation)')
#trace3_portfolio = go.Scatter(x = time_ind, y = ddpg_cumpod, mode = 'lines', name = 'DDPG')
#trace4_portfolio = go.Scatter(x = time_ind, y = addpg_cumpod, mode = 'lines', name = 'Adaptive-DDPG')
#trace5_portfolio = go.Scatter(x = time_ind, y = min_cumpod, mode = 'lines', name = 'Min-Variance')

#trace4 = go.Scatter(x = time_ind, y = addpg_cumpod, mode = 'lines', name = 'Adaptive-DDPG')

#trace2 = go.Scatter(x = time_ind, y = portfolio_cost_minv, mode = 'lines', name = 'Min-Variance')
#trace3 = go.Scatter(x = time_ind, y = spx_value, mode = 'lines', name = 'SPX')

In [None]:
fig = go.Figure()
fig.add_trace(trace0_portfolio)

fig.add_trace(trace1_portfolio)

fig.add_trace(trace2_portfolio)

#fig.add_trace(trace3_portfolio)

fig.update_layout(
    legend=dict(
        x=0,
        y=1,
        traceorder="normal",
        font=dict(
            family="sans-serif",
            size=15,
            color="black"
        ),
        bgcolor="White",
        bordercolor="white",
        borderwidth=2

    ),
)
#fig.update_layout(legend_orientation="h")
fig.update_layout(title={
        #'text': "Cumulative Return using FinRL",
        'y':0.85,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
#with Transaction cost
#fig.update_layout(title =  'Quarterly Trade Date')
fig.update_layout(
#    margin=dict(l=20, r=20, t=20, b=20),

    paper_bgcolor='rgba(1,1,0,0)',
    plot_bgcolor='rgba(1, 1, 0, 0)',
    #xaxis_title="Date",
    yaxis_title="Cumulative Return",
xaxis={'type': 'date',
       'tick0': time_ind[0],
        'tickmode': 'linear',
       'dtick': 86400000.0 *80}

)
fig.update_xaxes(showline=True,linecolor='black',showgrid=True, gridwidth=1, gridcolor='LightSteelBlue',mirror=True)
fig.update_yaxes(showline=True,linecolor='black',showgrid=True, gridwidth=1, gridcolor='LightSteelBlue',mirror=True)
fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='LightSteelBlue')

fig.show()

fig.write_image("portfolio_return_plot.png")

# Download the file
files.download("portfolio_return_plot.png")
