# Deep Reinforcement Learning for Stock Trading from Scratch: Portfolio Allocation

Tutorials to use OpenAI DRL to perform portfolio allocation in one Jupyter Notebook | Presented at NeurIPS 2020: Deep RL Workshop

* This blog is based on our paper: FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance, presented at NeurIPS 2020: Deep RL Workshop.
* Check out medium blog for detailed explanations: https://towardsdatascience.com/finrl-for-quantitative-finance-tutorial-for-portfolio-allocation-9b417660c7cd
* Please report any issues to our Github: https://github.com/AI4Finance-Foundation/FinRL/issues

ESG-VARIABLES-PENALIZING
* **Pytorch Version**

# Content

* [1. Problem Definition](#0)
* [2. Getting Started - Load Python packages](#1)
    * [2.1. Install Packages](#1.1)    
    * [2.2. Check Additional Packages](#1.2)
    * [2.3. Import Packages](#1.3)
    * [2.4. Create Folders](#1.4)
* [3. Download Data](#2)
* [4. Preprocess Data](#3)        
    * [4.1. Technical Indicators](#3.1)
    * [4.2. Perform Feature Engineering](#3.2)
* [5.Build Environment](#4)  
    * [5.1. Training & Trade Data Split](#4.1)
    * [5.2. User-defined Environment](#4.2)   
    * [5.3. Initialize Environment](#4.3)    
* [6.Implement DRL Algorithms](#5)  
* [7.Backtesting Performance](#6)  
    * [7.1. BackTestStats](#6.1)
    * [7.2. BackTestPlot](#6.2)   
    * [7.3. Baseline Stats](#6.3)   
    * [7.3. Compare to Stock Market Index](#6.4)             

<a id='0'></a>
# Part 1. Problem Definition

# **Problem Definition: Risk-Constrained Portfolio Optimization Using Deep Reinforcement Learning**  

This problem aims to design an **automated trading solution** for portfolio allocation while ensuring risk constraints are satisfied.  
We model the stock trading process as a **Constrained Markov Decision Process (CMDP)** and formulate our objective as a **constrained maximization problem**:  
- **Maximize** portfolio returns.  
- **Minimize** risk exposure within predefined limits.  

The algorithm is trained using **Deep Reinforcement Learning (DRL)** techniques, integrating **Deep Deterministic Policy Gradient (DDPG) with an Augmented Lagrangian Multiplier (ALM)** to handle risk constraints dynamically.  

---

## **Reinforcement Learning Environment Components**  

### **1️⃣ Action Space**  
- The **agent selects portfolio weights** for each asset at each time step.  
- Action vector:  
  \[
  \mathbf{a} = [a_1, a_2, \dots, a_N], \quad \text{where } a_i \in (-1,1) \text{ and } \sum_{i=1}^{N} a_i = 1
  \]
- **Intuition:**  
  - \( a_i > 0 \) → **Long** position in stock \( i \).  
  - \( a_i < 0 \) → **Short** position in stock \( i \).  
  - \( a_i = 0 \) → No investment in stock \( i \).  
- Example: *"Allocate 10% of capital to AAPL"* → **Action = [0.1, ..., 0]**.  

---

### **2️⃣ State Space**  
The agent **observes market conditions** before making trading decisions. The state vector contains:  
- **Price-based features**: Open, High, Low, Close, Volume (OHLCV).  
- **Technical indicators**: Moving Averages, RSI, MACD, Bollinger Bands.  
- **Risk metrics**: Portfolio variance, Value-at-Risk (VaR).  
- **Portfolio state**: Previous allocations, returns.  

\[
\mathbf{s} = [\text{OHLCV}, \text{indicators}, \text{portfolio state}]
\]

*Example:*  
At time \( t \), the state could be:  
\[
s_t = [\text{AAPL close}, \text{GOOGL RSI}, \text{Portfolio return}, \dots]
\]  

---

### **3️⃣ Reward Function**  
The agent receives a reward based on portfolio performance:  
\[
r(s, a, s') = v' - v - \lambda \cdot \text{Risk}
\]
Where:  
- \( v' \) and \( v \) → Portfolio values before and after action.  
- \( \lambda \) → Lagrangian multiplier for risk constraint.  
- **Risk term**: VaR, variance, or drawdown penalty.  

This encourages the agent to **maximize return while controlling risk**.  

---

### **4️⃣ Cost Function (Risk Constraint)**
To enforce safety, we introduce a **risk-based cost function**:  
\[
c(s, a) = \max( \text{VaR} - \text{Risk Threshold}, 0)
\]
- If **risk exceeds** the threshold, a penalty is applied.  
- Otherwise, the cost is **zero**.  
- The **Augmented Lagrangian Multiplier (ALM)** updates dynamically to enforce this constraint.  

---

### **5️⃣ Environment**
The trading environment consists of **S&P 500 stocks** (or another index).  
- **Data Source**: Yahoo Finance API, Alpha Vantage, or Quandl.  
- **Time Frame**: Daily, hourly, or minute-level data.  
- **Stock Pool**: Top 50 stocks based on market cap.  

Example:  
- **Dow 30 Constituents** (AAPL, MSFT, TSLA, etc.).  
- Data includes **OHLCV + technical indicators**.  

---

### **Key Contributions of this Model**  
✅ **Risk-aware DRL framework** for portfolio allocation.  
✅ **Handles risk dynamically** using CMDP and ALM.  
✅ **Optimized for long-short portfolio strategies**.  
✅ **Scalable to multiple assets and real-world trading.**  



<a id='1'></a>
# Part 2. Getting Started- Load Python Packages

In [None]:
pip install setuptools==66

: 

In [None]:
!pip install stockstats
!pip install hyperopt
# !pip install pyfolio
import stockstats
from hyperopt import fmin, tpe, hp, Trials, space_eval
# import pyfolio
from collections import deque

In [None]:
"""Contains methods and classes to collect data from
Yahoo Finance API
"""

from __future__ import annotations

import pandas as pd
import yfinance as yf


class YahooDownloader:
    """Provides methods for retrieving daily stock data from
    Yahoo Finance API

    Attributes
    ----------
        start_date : str
            start date of the data (modified from neofinrl_config.py)
        end_date : str
            end date of the data (modified from neofinrl_config.py)
        ticker_list : list
            a list of stock tickers (modified from neofinrl_config.py)

    Methods
    -------
    fetch_data()
        Fetches data from yahoo API

    """

    def __init__(self, start_date: str, end_date: str, ticker_list: list):
        self.start_date = start_date
        self.end_date = end_date
        self.ticker_list = ticker_list

    def fetch_data(self, proxy=None, auto_adjust=False) -> pd.DataFrame:
        """Fetches data from Yahoo API
        Parameters
        ----------

        Returns
        -------
        `pd.DataFrame`
            7 columns: A date, open, high, low, close, volume and tick symbol
            for the specified stock ticker
        """
        # Download and save the data in a pandas DataFrame:
        data_df = pd.DataFrame()
        num_failures = 0
        for tic in self.ticker_list:
            temp_df = yf.download(
                tic,
                start=self.start_date,
                end=self.end_date,
                proxy=proxy,
                auto_adjust=auto_adjust,
            )
            if temp_df.columns.nlevels != 1:
                temp_df.columns = temp_df.columns.droplevel(1)
            temp_df["tic"] = tic
            if len(temp_df) > 0:
                # data_df = data_df.append(temp_df)
                data_df = pd.concat([data_df, temp_df], axis=0)
            else:
                num_failures = num_failures+ 1
        if num_failures == len(self.ticker_list):
            raise ValueError("no data is fetched.")
        # reset the index, we want to use numbers as index instead of dates
        data_df = data_df.reset_index()
        try:
            # convert the column names to standardized names
            data_df.rename(
                columns={
                    "Date": "date",
                    "Adj Close": "adjcp",
                    "Close": "close",
                    "High": "high",
                    "Low": "low",
                    "Volume": "volume",
                    "Open": "open",
                    "tic": "tic",
                },
                inplace=True,
            )

            # use adjusted close price instead of close price
            data_df["close"] = data_df["adjcp"]
            # drop the adjusted close price column
            data_df = data_df.drop(labels="adjcp", axis=1)
        except NotImplementedError:
            print("the features are not supported currently")
        # create day of the week column (monday = 0)
        data_df["day"] = data_df["date"].dt.dayofweek
        # convert date to standard string format, easy to filter
        data_df["date"] = data_df.date.apply(lambda x: x.strftime("%Y-%m-%d"))
        # drop missing data
        data_df = data_df.dropna()
        data_df = data_df.reset_index(drop=True)
        print("Shape of DataFrame: ", data_df.shape)
        # print("Display DataFrame: ", data_df.head())

        data_df = data_df.sort_values(by=["date", "tic"]).reset_index(drop=True)

        return data_df

    def select_equal_rows_stock(self, df):
        df_check = df.tic.value_counts()
        df_check = pd.DataFrame(df_check).reset_index()
        df_check.columns = ["tic", "counts"]
        mean_df = df_check.counts.mean()
        equal_list = list(df.tic.value_counts() >= mean_df)
        names = df.tic.value_counts().index
        select_stocks_list = list(names[equal_list])
        df = df[df.tic.isin(select_stocks_list)]
        return df

In [None]:
import datetime
import numpy as np
import pandas as pd
from multiprocessing.sharedctypes import Value

import numpy as np
import pandas as pd
from stockstats import StockDataFrame as Sdf

def load_dataset(*, file_name: str) -> pd.DataFrame:
    """
    load csv dataset from path
    :return: (df) pandas dataframe
    """
    # _data = pd.read_csv(f"{config.DATASET_DIR}/{file_name}")
    _data = pd.read_csv(file_name)
    return _data


def data_split(df, start, end, target_date_col="date"):
    """
    split the dataset into training or testing using date
    :param data: (df) pandas dataframe, start, end
    :return: (df) pandas dataframe
    """
    data = df[(df[target_date_col] >= start) & (df[target_date_col] < end)]
    data = data.sort_values([target_date_col, "tic"], ignore_index=True)
    data.index = data[target_date_col].factorize()[0]
    return data


def convert_to_datetime(time):
    time_fmt = "%Y-%m-%dT%H:%M:%S"
    if isinstance(time, str):
        return datetime.datetime.strptime(time, time_fmt)

In [None]:
# from __future__ import annotations

# import copy
# import datetime
# from copy import deepcopy

# !pip install empyrical
# import empyrical as ep

# import matplotlib.dates as mdates
# import matplotlib.pyplot as plt
# import numpy as np
# import pandas as pd
# !pip install pyfolio
# import pyfolio
# from pyfolio import timeseries
# import itertools

# # Replacing from pyfolio import timeseries with original codes ##

# def gross_lev(positions):
#     """
#     Calculates the gross leverage of a strategy.

#     Parameters
#     ----------
#     positions : pd.DataFrame
#         Daily net position values.
#          - See full explanation in tears.create_full_tear_sheet.

#     Returns
#     -------
#     pd.Series
#         Gross leverage.
#     """

#     exposure = positions.drop('cash', axis=1).abs().sum(axis=1)
#     return exposure / positions.sum(axis=1)

# def get_txn_vol(transactions):
#     """
#     Extract daily transaction data from set of transaction objects.

#     Parameters
#     ----------
#     transactions : pd.DataFrame
#         Time series containing one row per symbol (and potentially
#         duplicate datetime indices) and columns for amount and
#         price.

#     Returns
#     -------
#     pd.DataFrame
#         Daily transaction volume and number of shares.
#          - See full explanation in tears.create_full_tear_sheet.
#     """

#     txn_norm = transactions.copy()
#     txn_norm.index = txn_norm.index.normalize()
#     amounts = txn_norm.amount.abs()
#     prices = txn_norm.price
#     values = amounts * prices
#     daily_amounts = amounts.groupby(amounts.index).sum()
#     daily_values = values.groupby(values.index).sum()
#     daily_amounts.name = "txn_shares"
#     daily_values.name = "txn_volume"
#     return pd.concat([daily_values, daily_amounts], axis=1)

# def get_turnover(positions, transactions, denominator='AGB'):
#     """
#      - Value of purchases and sales divided
#     by either the actual gross book or the portfolio value
#     for the time step.

#     Parameters
#     ----------
#     positions : pd.DataFrame
#         Contains daily position values including cash.
#         - See full explanation in tears.create_full_tear_sheet
#     transactions : pd.DataFrame
#         Prices and amounts of executed trades. One row per trade.
#         - See full explanation in tears.create_full_tear_sheet
#     denominator : str, optional
#         Either 'AGB' or 'portfolio_value', default AGB.
#         - AGB (Actual gross book) is the gross market
#         value (GMV) of the specific algo being analyzed.
#         Swapping out an entire portfolio of stocks for
#         another will yield 200% turnover, not 100%, since
#         transactions are being made for both sides.
#         - We use average of the previous and the current end-of-period
#         AGB to avoid singularities when trading only into or
#         out of an entire book in one trading period.
#         - portfolio_value is the total value of the algo's
#         positions end-of-period, including cash.

#     Returns
#     -------
#     turnover_rate : pd.Series
#         timeseries of portfolio turnover rates.
#     """

#     txn_vol = get_txn_vol(transactions)
#     traded_value = txn_vol.txn_volume

#     if denominator == 'AGB':
#         # Actual gross book is the same thing as the algo's GMV
#         # We want our denom to be avg(AGB previous, AGB current)
#         AGB = positions.drop('cash', axis=1).abs().sum(axis=1)
#         denom = AGB.rolling(2).mean()

#         # Since the first value of pd.rolling returns NaN, we
#         # set our "day 0" AGB to 0.
#         denom.iloc[0] = AGB.iloc[0] / 2
#     elif denominator == 'portfolio_value':
#         denom = positions.sum(axis=1)
#     else:
#         raise ValueError(
#             "Unexpected value for denominator '{}'. The "
#             "denominator parameter must be either 'AGB'"
#             " or 'portfolio_value'.".format(denominator)
#         )

#     denom.index = denom.index.normalize()
#     turnover = traded_value.div(denom, axis='index')
#     turnover = turnover.fillna(0)
#     return turnover

# SIMPLE_STAT_FUNCS = [
#     ep.annual_return,
#     ep.cum_returns_final,
#     ep.annual_volatility,
#     ep.sharpe_ratio,
#     ep.calmar_ratio,
#     ep.stability_of_timeseries,
#     # ep.max_drawdown,
#     ep.omega_ratio,
#     # ep.sortino_ratio,
#     # stats.skew,
#     # stats.kurtosis,
#     # ep.tail_ratio,
#     # value_at_risk
# ]

# FACTOR_STAT_FUNCS = [
#     # ep.alpha,
#     # ep.beta,
# ]

# STAT_FUNC_NAMES = {
#     'annual_return': 'Annual return',
#     'cum_returns_final': 'Cumulative returns',
#     'annual_volatility': 'Annual volatility',
#     'sharpe_ratio': 'Sharpe ratio',
#     'calmar_ratio': 'Calmar ratio',
#     'stability_of_timeseries': 'Stability',
#     # 'max_drawdown': 'Max drawdown',
#     'omega_ratio': 'Omega ratio',
#     # 'sortino_ratio': 'Sortino ratio',
#     # 'skew': 'Skew',
#     # 'kurtosis': 'Kurtosis',
#     # 'tail_ratio': 'Tail ratio',
#     # 'common_sense_ratio': 'Common sense ratio',
#     # 'value_at_risk': 'Daily value at risk',
#     # 'alpha': 'Alpha',
#     # 'beta': 'Beta',
# }


# def perf_stats(returns, factor_returns=None, positions=None,
#                transactions=None, turnover_denom='AGB'):
#     """
#     Calculates various performance metrics of a strategy, for use in
#     plotting.show_perf_stats.

#     Parameters
#     ----------
#     returns : pd.Series
#         Daily returns of the strategy, noncumulative.
#          - See full explanation in tears.create_full_tear_sheet.
#     factor_returns : pd.Series, optional
#         Daily noncumulative returns of the benchmark factor to which betas are
#         computed. Usually a benchmark such as market returns.
#          - This is in the same style as returns.
#          - If None, do not compute alpha, beta, and information ratio.
#     positions : pd.DataFrame
#         Daily net position values.
#          - See full explanation in tears.create_full_tear_sheet.
#     transactions : pd.DataFrame
#         Prices and amounts of executed trades. One row per trade.
#         - See full explanation in tears.create_full_tear_sheet.
#     turnover_denom : str
#         Either AGB or portfolio_value, default AGB.
#         - See full explanation in txn.get_turnover.

#     Returns
#     -------
#     pd.Series
#         Performance metrics.
#     """

#     stats = pd.Series()
#     for stat_func in SIMPLE_STAT_FUNCS:
#         stats[STAT_FUNC_NAMES[stat_func.__name__]] = stat_func(returns)

#     if positions is not None:
#         stats['Gross leverage'] = gross_lev(positions).mean()
#         if transactions is not None:
#             stats['Daily turnover'] = get_turnover(positions,
#                                                    transactions,
#                                                    turnover_denom).mean()
#     if factor_returns is not None:
#         for stat_func in FACTOR_STAT_FUNCS:
#             res = stat_func(returns, factor_returns)
#             stats[STAT_FUNC_NAMES[stat_func.__name__]] = res

#     return stats
# #######################
# def date2str(dat: datetime.date) -> str:
#     return datetime.date.strftime(dat, "%Y-%m-%d")

# def str2date(dat: str) -> datetime.date:
#     return datetime.datetime.strptime(dat, "%Y-%m-%d").date()

# def get_daily_return(df, value_col_name="account_value"):
#     df = deepcopy(df)
#     df["daily_return"] = df[value_col_name].pct_change(1)
#     df["date"] = pd.to_datetime(df["date"])
#     df.set_index("date", inplace=True, drop=True)
#     df.index = df.index.tz_localize("UTC")
#     return pd.Series(df["daily_return"], index=df.index)


# def convert_daily_return_to_pyfolio_ts(df):
#     strategy_ret = df.copy()
#     strategy_ret["date"] = pd.to_datetime(strategy_ret["date"])
#     strategy_ret.set_index("date", drop=False, inplace=True)
#     strategy_ret.index = strategy_ret.index.tz_localize("UTC")
#     del strategy_ret["date"]
#     return pd.Series(strategy_ret["daily_return"].values, index=strategy_ret.index)


# # def backtest_stats(account_value, value_col_name="account_value"):
# #     dr_test = get_daily_return(account_value, value_col_name=value_col_name)
# #     perf_stats_all = timeseries.perf_stats(
# #         returns=dr_test,
# #         positions=None,
# #         transactions=None,
# #         turnover_denom="AGB",
# #     )
# #     print(perf_stats_all)
# #     return perf_stats_all

# def backtest_stats(account_value, value_col_name="account_value"):
#     dr_test = get_daily_return(account_value, value_col_name=value_col_name)
#     perf_stats_all = perf_stats(
#         returns=dr_test,
#         positions=None,
#         transactions=None,
#         turnover_denom="AGB",
#     )
#     print(perf_stats_all)
#     return perf_stats_all


# # def backtest_plot(
# #     account_value,
# #     baseline_start=TRADE_START_DATE,
# #     baseline_end=TRADE_END_DATE,
# #     baseline_ticker="^DJI",
# #     value_col_name="account_value",
# # ):
# #     df = deepcopy(account_value)
# #     df["date"] = pd.to_datetime(df["date"])
# #     test_returns = get_daily_return(df, value_col_name=value_col_name)

# #     baseline_df = get_baseline(
# #         ticker=baseline_ticker, start=baseline_start, end=baseline_end
# #     )

# #     baseline_df["date"] = pd.to_datetime(baseline_df["date"], format="%Y-%m-%d")
# #     baseline_df = pd.merge(df[["date"]], baseline_df, how="left", on="date")
# #     baseline_df = baseline_df.fillna(method="ffill").fillna(method="bfill")
# #     baseline_returns = get_daily_return(baseline_df, value_col_name="close")

# #     with pyfolio.plotting.plotting_context(font_scale=1.1):
# #         pyfolio.create_full_tear_sheet(
# #             returns=test_returns, benchmark_rets=baseline_returns, set_context=False
# #         )


# def get_baseline(ticker, start, end):
#     return YahooDownloader(
#         start_date=start, end_date=end, ticker_list=[ticker]
#     ).fetch_data()


# def trx_plot(df_trade, df_actions, ticker_list):
#     df_trx = pd.DataFrame(np.array(df_actions["transactions"].to_list()))
#     df_trx.columns = ticker_list
#     df_trx.index = df_actions["date"]
#     df_trx.index.name = ""

#     for i in range(df_trx.shape[1]):
#         df_trx_temp = df_trx.iloc[:, i]
#         df_trx_temp_sign = np.sign(df_trx_temp)
#         buying_signal = df_trx_temp_sign.apply(lambda x: x > 0)
#         selling_signal = df_trx_temp_sign.apply(lambda x: x < 0)

#         tic_plot = df_trade[
#             (df_trade["tic"] == df_trx_temp.name)
#             & (df_trade["date"].isin(df_trx.index))
#         ]["close"]
#         tic_plot.index = df_trx_temp.index

#         plt.figure(figsize=(10, 8))
#         plt.plot(tic_plot, color="g", lw=2.0)
#         plt.plot(
#             tic_plot,
#             "^",
#             markersize=10,
#             color="m",
#             label="buying signal",
#             markevery=buying_signal,
#         )
#         plt.plot(
#             tic_plot,
#             "v",
#             markersize=10,
#             color="k",
#             label="selling signal",
#             markevery=selling_signal,
#         )
#         plt.title(
#             f"{df_trx_temp.name} Num Transactions: {len(buying_signal[buying_signal == True]) + len(selling_signal[selling_signal == True])}"
#         )
#         plt.legend()
#         plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=25))
#         plt.xticks(rotation=45, ha="right")
#         plt.show()


# # 2022-01-15 -> 01/15/2022
# def transfer_date(str_dat):
#     return datetime.datetime.strptime(str_dat, "%Y-%m-%d").date().strftime("%m/%d/%Y")


# def plot_result_from_csv(
#     csv_file: str,
#     column_as_x: str,
#     savefig_filename: str = "fig/result.png",
#     xlabel: str = "Date",
#     ylabel: str = "Result",
#     num_days_xticks: int = 20,
#     xrotation: int = 0,
# ):
#     result = pd.read_csv(csv_file)
#     plot_result(
#         result,
#         column_as_x,
#         savefig_filename,
#         xlabel,
#         ylabel,
#         num_days_xticks,
#         xrotation,
#     )


# # select_start_date: included
# # select_end_date: included
# # is if_need_calc_return is True, it is account_value, and then transfer it to return
# # it is better that column_as_x is the first column, and the other columns are strategies
# # xrotation: the rotation of xlabel, may be used in dates. Default=0 (adaptive adjustment)
# def plot_result(
#     result: pd.DataFrame(),
#     column_as_x: str,
#     savefig_filename: str = "fig/result.png",
#     xlabel: str = "Date",
#     ylabel: str = "Result",
#     num_days_xticks: int = 20,
#     xrotation: int = 0,
# ):
#     columns = result.columns
#     columns_strtegy = []
#     for i in range(len(columns)):
#         col = columns[i]
#         if "Unnamed" not in col and col != column_as_x:
#             columns_strtegy.append(col)

#     result.reindex()

#     x = result[column_as_x].values.tolist()
#     plt.rcParams["figure.figsize"] = (15, 6)
#     # plt.figure()

#     fig, ax = plt.subplots()
#     colors = [
#         "black",
#         "red",
#         "green",
#         "blue",
#         "cyan",
#         "magenta",
#         "yellow",
#         "aliceblue",
#         "coral",
#         "darksalmon",
#         "firebrick",
#         "honeydew",
#     ]
#     for i in range(len(columns_strtegy)):
#         col = columns_strtegy[i]
#         ax.plot(
#             x,
#             result[col],
#             color=colors[i],
#             linewidth=1,
#             linestyle="-",
#         )

#     plt.title("", fontsize=20)
#     plt.xlabel(xlabel, fontsize=20)
#     plt.ylabel(ylabel, fontsize=20)

#     plt.legend(labels=columns_strtegy, loc="best", fontsize=16)

#     # set grid
#     plt.grid()

#     plt.xticks(size=22)  # 设置刻度大小
#     plt.yticks(size=22)  # 设置刻度大小

#     # #设置每隔多少距离⼀个刻度
#     # plt.xticks(x[::60])

#     # # 设置每月定位符
#     # if if_set_x_monthlocator:
#     #     ax.xaxis.set_major_locator(mdates.MonthLocator())  # interval = 1

#     # 设置每隔多少距离⼀个刻度
#     plt.xticks(x[::num_days_xticks])

#     plt.setp(ax.get_xticklabels(), rotation=xrotation, horizontalalignment="center")

#     # 为防止x轴label重叠，自动调整label旋转角度
#     if xrotation == 0:
#         if_overlap = get_if_overlap(fig, ax)

#         if if_overlap == True:
#             plt.gcf().autofmt_xdate(ha="right")  # ⾃动旋转⽇期标记

#     plt.tight_layout()  # 自动调整子图间距

#     plt.savefig(savefig_filename)

#     plt.show()


# def get_if_overlap(fig, ax):
#     fig.canvas.draw()
#     # 获取日期标签的边界框
#     bboxes = [label.get_window_extent() for label in ax.get_xticklabels()]
#     # 计算日期标签之间的距离
#     distances = [bboxes[i + 1].x0 - bboxes[i].x1 for i in range(len(bboxes) - 1)]
#     # 如果有任何距离小于0，说明有重叠
#     if any(distance < 0 for distance in distances):
#         if_overlap = True
#     else:
#         if_overlap = False

#     return if_overlap


# def plot_return(
#     result: pd.DataFrame(),
#     column_as_x: str,
#     if_need_calc_return: bool,
#     savefig_filename: str = "fig/result.png",
#     xlabel: str = "Date",
#     ylabel: str = "Return",
#     if_transfer_date: bool = True,
#     select_start_date: str = None,
#     select_end_date: str = None,
#     num_days_xticks: int = 20,
#     xrotation: int = 0,
# ):
#     if select_start_date is None:
#         select_start_date: str = result[column_as_x].iloc[0]
#         select_end_date: str = result[column_as_x].iloc[-1]
#     # calc returns if if_need_calc_return is True, so that result stores returns
#     select_start_date_index = result[column_as_x].tolist().index(select_start_date)
#     columns = result.columns
#     columns_strtegy = []
#     column_as_x_index = None
#     for i in range(len(columns)):
#         col = columns[i]
#         if col == column_as_x:
#             column_as_x_index = i
#         elif "Unnamed" not in col:
#             columns_strtegy.append(col)
#             if if_need_calc_return:
#                 result[col] = result[col] / result[col][select_start_date_index] - 1

#     # select the result between select_start_date and select_end_date
#     # if date is 2020-01-15, transfer it to 01/15/2020
#     num_rows, num_cols = result.shape
#     tmp_result = copy.deepcopy(result)
#     result = pd.DataFrame()
#     if_first_row = True
#     columns = []
#     for i in range(num_rows):
#         if (
#             str2date(select_start_date)
#             <= str2date(tmp_result[column_as_x][i])
#             <= str2date(select_end_date)
#         ):
#             if "-" in tmp_result.iloc[i][column_as_x] and if_transfer_date:
#                 new_date = transfer_date(tmp_result.iloc[i][column_as_x])
#             else:
#                 new_date = tmp_result.iloc[i][column_as_x]
#             tmp_result.iloc[i, column_as_x_index] = new_date
#             # print("tmp_result.iloc[i]: ", tmp_result.iloc[i])
#             # result = result.append(tmp_result.iloc[i])
#             if if_first_row:
#                 columns = tmp_result.iloc[i].index.tolist()
#                 result = pd.DataFrame(columns=columns)
#                 # result = pd.concat([result, tmp_result.iloc[i]], axis=1)
#                 # result = pd.DataFrame(tmp_result.iloc[i])
#                 # result.columns = tmp_result.iloc[i].index.tolist()
#                 if_first_row = False
#             row = pd.DataFrame([tmp_result.iloc[i].tolist()], columns=columns)
#             result = pd.concat([result, row], axis=0)

#     # print final return of each strategy
#     final_return = {}
#     for col in columns_strtegy:
#         final_return[col] = result.iloc[-1][col]
#     print("final return: ", final_return)

#     result.reindex()

#     plot_result(
#         result=result,
#         column_as_x=column_as_x,
#         savefig_filename=savefig_filename,
#         xlabel=xlabel,
#         ylabel=ylabel,
#         num_days_xticks=num_days_xticks,
#         xrotation=xrotation,
#     )


# def plot_return_from_csv(
#     csv_file: str,
#     column_as_x: str,
#     if_need_calc_return: bool,
#     savefig_filename: str = "fig/result.png",
#     xlabel: str = "Date",
#     ylabel: str = "Return",
#     if_transfer_date: bool = True,
#     select_start_date: str = None,
#     select_end_date: str = None,
#     num_days_xticks: int = 20,
#     xrotation: int = 0,
# ):
#     result = pd.read_csv(csv_file)
#     plot_return(
#         result,
#         column_as_x,
#         if_need_calc_return,
#         savefig_filename,
#         xlabel,
#         ylabel,
#         if_transfer_date,
#         select_start_date,
#         select_end_date,
#         num_days_xticks,
#         xrotation,
#     )

In [None]:
import copy
import datetime
import os
from datetime import date
from datetime import timedelta
from typing import List
from typing import Tuple

import numpy as np
import pandas as pd

In [None]:
# ## install finrl library
# !pip install wrds
# !pip install swig
# !pip install 'shimmy>=2.0'
# !pip install git+https://github.com/AI4Finance-Foundation/FinRL.git

<a id='1.1'></a>
## 2.1. Install all the packages through FinRL library




<a id='1.2'></a>
## 2.2. Check if the additional packages needed are present, if not install them.
* Yahoo Finance API
* pandas
* numpy
* matplotlib
* stockstats
* OpenAI gym
* stable-baselines
* tensorflow
* pyfolio

In [None]:
!pip install hyperopt
from hyperopt import fmin, tpe, hp, Trials, space_eval

In [None]:
# #Importing the libraries
# !pip install pandas_market_calendars
# import pandas as pd
# import numpy as np
# import matplotlib
# import matplotlib.pyplot as plt
# matplotlib.use('Agg')
# import datetime
# %matplotlib inline
# from finrl import config
# from finrl import config_tickers
# from finrl.meta.preprocessor.yahoodownloader import YahooDownloader
# from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split
# from finrl.meta.env_stock_trading.env_stocktrading import StockTradingEnv
# from finrl.meta.env_stock_trading.env_stocktrading_np import StockTradingEnv as StockTradingEnv_numpy
# from finrl.agents.stablebaselines3.models import DRLAgent
# # from finrl.agents.rllib.models import DRLAgent as DRLAgent_rllib
# from finrl.meta.data_processor import DataProcessor
# import joblib
# from stable_baselines3.common.logger import configure
# from finrl.plot import backtest_stats, backtest_plot, get_daily_return, get_baseline,convert_daily_return_to_pyfolio_ts
# from finrl.meta.data_processors.processor_yahoofinance import YahooFinanceProcessor
# import ray
# from pprint import pprint

# import sys
# sys.path.append("../FinRL-Library")

# import itertools

In [None]:
# import pandas as pd
# import numpy as np
# import matplotlib
# import matplotlib.pyplot as plt
# matplotlib.use('Agg')
# %matplotlib inline
# import datetime

# from finrl import config
# from finrl import config_tickers
# from finrl.meta.preprocessor.yahoodownloader import YahooDownloader
# from finrl.meta.preprocessor.preprocessors import FeatureEngineer, data_split
# from finrl.meta.env_portfolio_allocation.env_portfolio import StockPortfolioEnv
# from finrl.agents.stablebaselines3.models import DRLAgent
# from finrl.plot import backtest_stats, backtest_plot, get_daily_return, get_baseline,convert_daily_return_to_pyfolio_ts
# from finrl.meta.data_processor import DataProcessor
# from finrl.meta.data_processors.processor_yahoofinance import YahooFinanceProcessor
# import sys
# sys.path.append("../FinRL-Library")

<a id='1.4'></a>
## 2.4. Create Folders


In [None]:
# import os
# if not os.path.exists("./" + config.DATA_SAVE_DIR):
#     os.makedirs("./" + config.DATA_SAVE_DIR)
# if not os.path.exists("./" + config.TRAINED_MODEL_DIR):
#     os.makedirs("./" + config.TRAINED_MODEL_DIR)
# if not os.path.exists("./" + config.TENSORBOARD_LOG_DIR):
#     os.makedirs("./" + config.TENSORBOARD_LOG_DIR)
# if not os.path.exists("./" + config.RESULTS_DIR):
#     os.makedirs("./" + config.RESULTS_DIR)

<a id='2'></a>
# Part 3. Download Data
Yahoo Finance is a website that provides stock data, financial news, financial reports, etc. All the data provided by Yahoo Finance is free.
* FinRL uses a class **YahooDownloader** to fetch data from Yahoo Finance API
* Call Limit: Using the Public API (without authentication), you are limited to 2,000 requests per hour per IP (or up to a total of 48,000 requests a day).


In [None]:
Nifty_ticker = ['RELIANCE.NS', 'ASIANPAINT.NS', 'BAJFINANCE.NS', 'HDFCBANK.NS', 'SBIN.NS']
sensex_ticker = ["ASIANPAINT.NS", "AXISBANK.NS", "BAJFINANCE.NS", "BAJAJFINSV.NS", "BHARTIARTL.NS", "HCLTECH.NS", "HDFCBANK.NS",
                 "HINDUNILVR.NS", "ICICIBANK.NS", "INDUSINDBK.NS", "INFY.NS", "ITC.NS", "JSWSTEEL.NS", "KOTAKBANK.NS", "LT.NS",
                 "M&M.NS", "MARUTI.NS", "NESTLEIND.NS", "NTPC.NS", "POWERGRID.NS", "RELIANCE.NS", "SBIN.NS", "SUNPHARMA.NS",
                 "TATAMOTORS.NS", "TATASTEEL.NS", "TCS.NS", "TECHM.NS", "TITAN.NS", "ULTRACEMCO.NS", "WIPRO.NS"]


# BIST Turkey
bist100_top30_tickers = ['AEFES.IS', 'AKBNK.IS', 'ARCLK.IS', 'ASELS.IS', 'BIMAS.IS', 'CCOLA.IS',
       'DOHOL.IS', 'EKGYO.IS', 'ENKAI.IS', 'EREGL.IS', 'FROTO.IS', 'GARAN.IS',
       'GOLTS.IS', 'HALKB.IS', 'ISCTR.IS', 'KCHOL.IS', 'KOZAL.IS', 'KRDMD.IS',
       'PETKM.IS', 'SAHOL.IS', 'SISE.IS', 'TAVHL.IS', 'TCELL.IS', 'THYAO.IS',
       'TKFEN.IS', 'TOASO.IS', 'TTKOM.IS', 'TUPRS.IS', 'ULKER.IS', 'VAKBN.IS',
       'VESTL.IS', 'YKBNK.IS']

# Spain IBEX top 30
ibex35_tickers = ['ACS.MC', 'ACX.MC', 'AMS.MC', 'ANA.MC', 'BBVA.MC', 'BKT.MC', 'CABK.MC',
       'COL.MC', 'ELE.MC', 'ENG.MC', 'FDR.MC', 'FER.MC', 'GRF.MC', 'IBE.MC',
       'IDR.MC', 'ITX.MC', 'MAP.MC', 'MEL.MC', 'MTS.MC', 'NTGY.MC', 'RED.MC',
       'REP.MC', 'ROVI.MC', 'SAB.MC', 'SAN.MC', 'SCYR.MC', 'SLR.MC', 'TEF.MC']

# Tickers for the top 30 stocks on B3 (Brasil Bolsa Balcão)

brazil_tickers = ['ABEV3.SA', 'BBAS3.SA', 'BPAN4.SA', 'BRFS3.SA', 'BRKM5.SA', 'CSNA3.SA',
       'CYRE3.SA', 'ECOR3.SA', 'EGIE3.SA', 'ELET3.SA', 'ELET6.SA', 'EMBR3.SA',
       'EQTL3.SA', 'GGBR4.SA', 'ITUB4.SA', 'JBSS3.SA', 'LREN3.SA',
       'MRFG3.SA', 'PETR3.SA', 'PETR4.SA', 'RADL3.SA', 'RENT3.SA', 'SBSP3.SA',
       'SUZB3.SA', 'UGPA3.SA', 'USIM5.SA', 'VALE3.SA', 'WEGE3.SA', 'YDUQ3.SA']


# Final Tickers Hang Seng (Hong Kong)
hang_seng_symbols = ['0002.HK', '0003.HK', '0012.HK', '0017.HK', '0027.HK', '0101.HK',
       '0241.HK', '0267.HK', '0669.HK', '0762.HK', '0836.HK', '0883.HK',
       '0906.HK', '0939.HK', '0992.HK', '1038.HK', '1044.HK', '1093.HK',
       '1109.HK', '1398.HK', '2020.HK', '2319.HK', '2331.HK', '2382.HK',
       '2628.HK', '2688.HK', '3323.HK', '3328.HK', '3983.HK', '3988.HK']

# Tiwan TWSE Market
twse_top30 = ['1216.TW', '1301.TW', '1303.TW', '1519.TW', '1537.TW', '2308.TW',
       '2317.TW', '2330.TW', '2363.TW', '2368.TW', '2382.TW', '2412.TW',
       '2454.TW', '2474.TW', '2504.TW', '2603.TW', '2838.TW', '2880.TW',
       '2881.TW', '2882.TW', '2884.TW', '2886.TW', '2891.TW', '2892.TW',
       '3008.TW', '3045.TW', '3653.TW', '4904.TW', '5880.TW', '6505.TW']
# UK FTSE top 30 working Stock
FTSE_top30 = ['ABF.L', 'ADM.L', 'AHT.L', 'AV.L', 'BA.L', 'BEZ.L', 'CCL.L', 'CNA.L',
       'DPLM.L', 'ENT.L', 'FRAS.L', 'HSBA.L', 'HWDN.L', 'III.L',
       'IMI.L', 'INF.L', 'MKS.L', 'MRO.L', 'NXT.L', 'PSON.L', 'REL.L', 'RR.L',
       'SBRY.L', 'SKG.L', 'SMDS.L', 'SMIN.L', 'SMT.L', 'SPX.L', 'SSE.L']
# Japanies Nikkei Top 30
nikkei_top30_symbols = ['2914.T', '3382.T', '3407.T', '3861.T', '4063.T', '4502.T', '4689.T',
       '4755.T', '5802.T', '6301.T', '6471.T', '6501.T', '6594.T', '6701.T',
       '6758.T', '6920.T', '7011.T', '7203.T', '7267.T', '7735.T', '7974.T',
       '8031.T', '8035.T', '8058.T', '8306.T', '8316.T', '9020.T', '9022.T',
       '9983.T', '9984.T']
# German DAX top 30
dax_30 = ['ADS.DE', 'AIR.DE', 'ALV.DE', 'BAS.DE', 'BEI.DE', 'BMW.DE', 'BNR.DE',
       'BOSS.DE', 'CBK.DE', 'CON.DE', 'DB1.DE', 'DBK.DE', 'DTE.DE', 'DWNI.DE',
       'EOAN.DE', 'EVT.DE', 'FME.DE', 'FNTN.DE', 'FRE.DE', 'HEI.DE', 'HNR1.DE',
       'LIN.DE', 'MRK.DE', 'MTX.DE', 'MUV2.DE', 'SAP.DE', 'SIE.DE', 'SY1.DE',
       'TL0.DE', 'VOW3.DE']
# USA Dow 30
Dow_30 = ['AAPL', 'AMGN', 'AXP', 'BA', 'CAT', 'CRM', 'CSCO', 'CVX', 'DIS', 'GS',
       'HD', 'HON', 'IBM', 'INTC', 'JNJ', 'JPM', 'KO', 'MCD', 'MMM', 'MRK',
       'MSFT', 'NKE', 'PG', 'TRV', 'UNH', 'V', 'VZ', 'WBA', 'WMT']

indices= [sensex_ticker, Dow_30, dax_30, nikkei_top30_symbols, FTSE_top30, twse_top30, hang_seng_symbols, brazil_tickers, ibex35_tickers, bist100_top30_tickers ]


# Download and save the data in a pandas DataFrame:
df = YahooDownloader(start_date = '2011-01-01',
                     end_date = '2025-02-28',
                     ticker_list = sensex_ticker).fetch_data()

In [None]:
df.shape

In [None]:
from stockstats import StockDataFrame as Sdf

def add_tech(data, INDICATORS):
  df = data.copy()
  df = df.sort_values(by=["tic", "date"])
  stock = Sdf.retype(df.copy())
  unique_ticker = stock.tic.unique()

  for indicator in INDICATORS:
      indicator_df = pd.DataFrame()
      for i in range(len(unique_ticker)):
          try:
              temp_indicator = stock[stock.tic == unique_ticker[i]][indicator]
              temp_indicator = pd.DataFrame(temp_indicator)
              temp_indicator["tic"] = unique_ticker[i]
              temp_indicator["date"] = df[df.tic == unique_ticker[i]][
                  "date"
              ].to_list()
              # indicator_df = indicator_df.append(
              #     temp_indicator, ignore_index=True
              # )
              indicator_df = pd.concat(
                  [indicator_df, temp_indicator], axis=0, ignore_index=True
              )
          except Exception as e:
              print(e)
      df = df.merge(
          indicator_df[["tic", "date", indicator]], on=["tic", "date"], how="left"
      )

  df = df.sort_values(by=["date", "tic"])

  return df

In [None]:
INDICATORS = ['macd', 'boll_ub', 'boll_lb', 'rsi_30', 'cci_30', 'dx_30', 'close_30_sma', 'close_60_sma']
df = add_tech(df, INDICATORS)
df = df.ffill().bfill()

In [None]:
# add covariance matrix as states
df=df.sort_values(['date','tic'],ignore_index=True)
df.index = df.date.factorize()[0]

cov_list = []
return_list = []

# look back is one year
lookback=252
for i in range(lookback,len(df.index.unique())):
  data_lookback = df.loc[i-lookback:i,:]
  price_lookback=data_lookback.pivot_table(index = 'date',columns = 'tic', values = 'close')
  return_lookback = price_lookback.pct_change().dropna()
  return_list.append(return_lookback)

  covs = return_lookback.cov().values
  cov_list.append(covs)


df_cov = pd.DataFrame({'date':df.date.unique()[lookback:],'cov_list':cov_list,'return_list':return_list})
df = df.merge(df_cov, on='date')
df = df.sort_values(['date','tic']).reset_index(drop=True)



In [None]:
df['cov_list'].head(3)


In [None]:
df.head(50)


In [None]:
df['return_list'].values[0]


In [None]:
print(df.shape)

hist_vol=[]
for i in range(len(df['return_list'])):
  returns = df['return_list'].values[i].std()
  hist_vol.append(returns)
print(len(hist_vol))



In [None]:
hist_vol= np.array(hist_vol)
# print(hist_vol.shape)
# print(hist_vol)
hist_vol= pd.DataFrame(hist_vol, df['date'])
# print(hist_vol.shape)
# print(df)
# df.to_csv('sensex_data.csv')
# hist_vol.to_csv('sensex_hist_vol.csv')
# from google.colab import files
# files.download('sensex_data.csv')
# files.download('sensex_hist_vol.csv')

In [None]:
# df= pd.read_csv("sensex_data.csv")
# hist_vol= pd.read_csv("sensex_hist_vol.csv")

<a id='4'></a>  
# **Part 5. Design Environment**  

We model portfolio optimization as a **Constrained Markov Decision Process (CMDP)**, ensuring **maximum returns while controlling risk**. The **state** includes stock prices, indicators, and risk metrics. The **action** is portfolio allocation, adjusting stock weights within constraints. **Rewards** maximize returns, while a **cost function** penalizes excessive risk. **Deep Deterministic Policy Gradient (DDPG) with Augmented Lagrangian Multiplier (ALM)** trains the agent:  
✅ **Actor** optimizes allocations.  
✅ **Critic** evaluates returns.  
✅ **Cost network** estimates risk.  
✅ **Lagrangian multiplier** enforces constraints.  
This ensures **risk-aware, reinforcement learning-based trading**, scalable to real markets.  


In [None]:
# %%capture
!pip install shimmy
!pip install stable_baselines3
!pip install gym

In [None]:
import numpy as np
import pandas as pd
from gym.utils import seeding
import gym
from gym import spaces
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from stable_baselines3.common.vec_env import DummyVecEnv

## Training data split: 2009-01-01 to 2020-07-01

In [None]:
TRAIN_START_DATE = '2011-01-01'
TRAIN_END_DATE = '2021-12-31'

# TRAIN_END_DATE = '2012-12-01'

Val_START_DATE = '2022-01-01'
VAL_END_DATE =  '2022-12-31'
TRADE_START_DATE = '2023-01-01'
TRADE_END_DATE = '2025-02-28'
# print(df[30:])
# hist_vol = hist_vol.reset_index(drop=True)

train = data_split(df, TRAIN_START_DATE,TRAIN_END_DATE)
hist_vol_train = hist_vol[TRAIN_START_DATE : TRAIN_END_DATE]

val = data_split(df, Val_START_DATE, VAL_END_DATE)
hist_vol_val=hist_vol[Val_START_DATE :VAL_END_DATE]

full_train = data_split(df, TRAIN_START_DATE, VAL_END_DATE)
hist_vol_full_train= hist_vol[TRAIN_START_DATE :VAL_END_DATE]


# full_train = data_split(df, TRAIN_START_DATE,TRAIN_END_DATE)
# hist_vol_full_train= hist_vol[TRAIN_START_DATE :TRAIN_END_DATE]

trade = data_split(df, TRADE_START_DATE,TRADE_END_DATE)
hist_vol_trade= hist_vol[TRADE_START_DATE  : TRADE_END_DATE]

print(full_train.shape)

Here is the definition of the environment.

In [None]:
class StockPortfolioEnv(gym.Env):
    """A single stock trading environment for OpenAI gym

    Attributes
    ----------
        df: DataFrame
            input data
        stock_dim : int
            number of unique stocks
        hmax : int
            maximum number of shares to trade
        initial_amount : int
            start money
        transaction_cost_pct: float
            transaction cost percentage per trade
        reward_scaling: float
            scaling factor for reward, good for training
        state_space: int
            the dimension of input features
        action_space: int
            equals stock dimension
        tech_indicator_list: list
            a list of technical indicator names
        turbulence_threshold: int
            a threshold to control risk aversion
        day: int
            an increment number to control date

    Methods
    -------
    _sell_stock()
        perform sell action based on the sign of the action
    _buy_stock()
        perform buy action based on the sign of the action
    step()
        at each step the agent will return actions, then
        we will calculate the reward, and return the next observation.
    reset()
        reset the environment
    render()
        use render to return other functions
    save_asset_memory()
        return account value at each time step
    save_action_memory()
        return actions/positions at each time step


    """
    metadata = {'render.modes': ['human']}

    def __init__(self,
                df,
                stock_dim,
                hmax,
                initial_amount,
                transaction_cost_pct,
                reward_scaling,
                state_space,
                action_space,
                tech_indicator_list,
                turbulence_threshold=None,
                lookback=252,
                day = 0, hist_vol= None):

        self.day = day
        self.lookback=lookback
        self.df = df
        self.stock_dim = stock_dim
        self.hmax = hmax
        self.initial_amount = initial_amount
        self.transaction_cost_pct =transaction_cost_pct
        self.reward_scaling = reward_scaling
        self.state_space = state_space
        self.action_space = action_space
        self.tech_indicator_list = tech_indicator_list
        self.hist_vol=hist_vol
        self.DSR_A = 0.0
        self.DSR_B = 0.0

         # action_space normalization and shape is self.stock_dim
        self.action_space = spaces.Box(low = 0, high = 1,shape = (self.action_space,))

        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape = (self.state_space+1 + len(self.tech_indicator_list), self.state_space))



        self.data = self.df.loc[self.day,:]
        self.covs = self.data['cov_list'].values[0]




        self.state = np.append(np.array(self.covs), [self.data[tech].values.tolist() for tech in self.tech_indicator_list ], axis=0)
        # print(" state  :: " , self.day ,self.state.shape, self.state)
        # print(" hist_ vol  :: " , self.day , type(self.hist_vol), self.hist_vol)

        hist_volll = self.hist_vol.values[self.day,:]
        # Concatenate along axis=0

        self.state = np.concatenate([self.state, hist_volll.reshape(1,-1) ], axis=0)



        # print("states - " , self.state.shape)

        self.terminal = False
        self.turbulence_threshold = turbulence_threshold
        # initalize state: inital portfolio return + individual stock return + individual weights
        self.portfolio_value = self.initial_amount

        # memorize portfolio value each step
        self.asset_memory = [self.initial_amount]
        # memorize portfolio return each step
        self.portfolio_return_memory = [0]
        self.actions_memory=[[1/self.stock_dim]*self.stock_dim]
        self.date_memory=[self.data.date.unique()[0]]



    def step(self, actions):
      print(f" the len of the df is  {len(self.df.index.unique())}  and the current day is :  {self.day } and  if  terminal is  : { self.day >= len(self.df.index.unique()) - 1 }")
      self.terminal = self.day >= len(self.df.index.unique()) - 1

      if self.terminal:
          # print("=================================")
          # print("begin_total_asset:{}".format(self.asset_memory[0]))
          # print("end_total_asset:{}".format(self.portfolio_value))
          # return self.state, self.reward, self.terminal, {}


          df = pd.DataFrame(self.portfolio_return_memory)
          df.columns = ['daily_return']
          # plt.plot(df.daily_return.cumsum(),'r')
          # plt.savefig('results/cumulative_reward.png')
          # plt.close()

          # plt.plot(self.portfolio_return_memory,'r')
          # plt.savefig('results/rewards.png')
          # plt.close()

          print("=================================")
          print("begin_total_asset:{}".format(self.asset_memory[0]))
          print("end_total_asset:{}".format(self.portfolio_value))

          df_daily_return = pd.DataFrame(self.portfolio_return_memory)
          df_daily_return.columns = ['daily_return']
          if df_daily_return['daily_return'].std() !=0:
            sharpe = (252**0.5)*df_daily_return['daily_return'].mean()/ \
                    df_daily_return['daily_return'].std()
            print("Sharpe: ",sharpe)
          print("=================================")


          return self.state, self.reward, self.terminal,{}
      else:
          last_day_memory = self.data
          weights = self.softmax_normalization(actions)  # Ensure valid portfolio weights
          self.actions_memory.append(weights)

          # Load next state
          self.day = self.day+ 1
          self.data = self.df.loc[self.day, :]
          self.covs = self.data['cov_list'].values[0]
          self.state = np.append(np.array(self.covs), [self.data[tech].values.tolist() for tech in self.tech_indicator_list ], axis=0)
          hist_voll= self.hist_vol.values[self.day,:]
          self.state = np.concatenate([self.state, hist_voll.reshape(1,-1) ], axis=0)

          # Portfolio Value Update
          portfolio_return = sum(((self.data.close.values / last_day_memory.close.values) - 1) * weights)
          new_portfolio_value = self.portfolio_value * (1 + portfolio_return)

          # Calculate Transaction Fee
          phi = 0.0025  # 0.25% transaction cost
          # Reshape portfolio_value to match dimensions of other arrays
          portfolio_value_reshaped = np.repeat(self.portfolio_value, len(weights))
          transaction_fee = phi * sum(
              abs(weights * new_portfolio_value * last_day_memory.close.values / self.data.close.values
                  - self.actions_memory[-2] * portfolio_value_reshaped)  # Use portfolio_value_reshaped
          )

          # Reward Calculation
          self.reward = (new_portfolio_value - self.portfolio_value) - transaction_fee  # r_t = u_t - u_t-1 - fee_t

          # Update portfolio value
          self.portfolio_value = new_portfolio_value

          # Save to memory
          self.portfolio_return_memory.append(portfolio_return)
          self.asset_memory.append(new_portfolio_value)
          self.date_memory.append(self.data.date.unique()[0])

          return self.state, self.reward, self.terminal, {}
    ##############################################




    def reset(self):
        self.asset_memory = [self.initial_amount]
        self.day = 0

        # returns = self.df['return_list'].values[0]
        # hist_vol = returns.rolling(window=30).std()
        # hist_vol.fillna(0, inplace=True)
        # hist_vol = hist_vol.iloc[self.day,:]


        self.data = self.df.loc[self.day,:]
        # load states
        self.covs = self.data['cov_list'].values[0]
        self.state =  np.append(np.array(self.covs), [self.data[tech].values.tolist() for tech in self.tech_indicator_list ], axis=0)
        # print(self.hist_vol)
        # self.hist_vol= self.hist_vol[self.day,]
        # Concatenate along axis=0

        hist_voll= self.hist_vol.values[self.day,:]
        self.state = np.concatenate([self.state, hist_voll.reshape(1,-1)], axis=0)
        # Concatenate along axis=0





        # print(" reset -- ev  --state -", self.state.shape)
        # print(" reset -- ev -- state - ", self.state)
        # print(" reset -- ev-- cov - ", self.state[:30, :].shape)
        # print(" reset -- ev-- his vol- ", self.state[:-1, :].shape)
        # print(" reset -- ev-- his vol- ", self.state[-1:, :])
        self.portfolio_value = self.initial_amount
        #self.cost = 0
        #self.trades = 0
        self.DSR_A = 0.0
        self.DSR_B = 0.0
        self.terminal = False
        self.portfolio_return_memory = [0]
        self.actions_memory=[[1/self.stock_dim]*self.stock_dim]
        self.date_memory=[self.data.date.unique()[0]]
        return self.state

    def render(self, mode='human'):
        return self.state

    def softmax_normalization(self, actions):
        numerator = np.exp(actions)
        denominator = np.sum(np.exp(actions))
        softmax_output = numerator/denominator
        return softmax_output


    def apply_dirichlet_noise(self, actions, alpha=0.1):
      """
      Apply Dirichlet noise to actions to encourage exploration.

      Args:
      - actions (np.array): Original action values from the RL model.
      - alpha (float): Dirichlet concentration parameter. Lower values = more noise.

      Returns:
      - np.array: Modified action values with noise, ensuring sum = 1.
      """
      noise = np.random.dirichlet([alpha] * len(actions))  # Sample from Dirichlet distribution
      noisy_actions = 0.75 * actions + 0.25 * noise  # Blend original actions with noise
      return noisy_actions / noisy_actions.sum()  # Normalize to ensure sum = 1




    def save_asset_memory(self):
        date_list = self.date_memory
        portfolio_return = self.portfolio_return_memory
        #print(len(date_list))
        #print(len(asset_list))
        df_account_value = pd.DataFrame({'date':date_list,'daily_return':portfolio_return})
        return df_account_value

    def save_action_memory(self):
        # date and close price length must match actions length
        date_list = self.date_memory
        df_date = pd.DataFrame(date_list)
        df_date.columns = ['date']

        action_list = self.actions_memory
        df_actions = pd.DataFrame(action_list)
        df_actions.columns = self.data.tic.values
        df_actions.index = df_date.date
        #df_actions = pd.DataFrame({'date':date_list,'actions':action_list})
        return df_actions

    def _seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def get_sb_env(self):
        e = DummyVecEnv([lambda: self])
        obs = e.reset()
        return e, obs

    def calculate_DSR(self, R):
      eta = 0.004
      delta_A = R - self.DSR_A
      delta_B = R**2 - self.DSR_B
      Dt = (self.DSR_B*delta_A - 0.5*self.DSR_A*delta_B) / ((self.DSR_B-self.DSR_A**2)**(3/2) + 1e-6)
      self.DSR_A = self.DSR_A + eta*delta_A
      self.DSR_B = self.DSR_B + eta*delta_B
      return(Dt)

In [None]:
stock_dimension = len(train.tic.unique())

state_space = stock_dimension
print(f"Stock Dimension: {stock_dimension}, State Space: {state_space}")

In [None]:
# print(INDICATORS)
TURBULENCE_THRESHOLD= 0.0020

env_kwargs_train = {
    "hmax": 100,
    "initial_amount": 1000000,
    "transaction_cost_pct": 0.001,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4,
    "hist_vol":hist_vol_train,
    'turbulence_threshold': TURBULENCE_THRESHOLD

}
# print(hist_vol_val,"  ddddd ")
env_kwargs_val = {
    "hmax": 100,
    "initial_amount": 1000000,
    "transaction_cost_pct": 0.001,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4,
    "hist_vol":hist_vol_val,
    "turbulence_threshold": TURBULENCE_THRESHOLD
}

env_kwargs_full = {
    "hmax": 100,
    "initial_amount": 1000000,
    "transaction_cost_pct": 0.001,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4,
    "hist_vol":hist_vol_full_train,
    "turbulence_threshold": TURBULENCE_THRESHOLD
}

env_kwargs_trade = {
    "hmax": 100,
    "initial_amount": 1000000,
    "transaction_cost_pct": 0.001,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4,
    "hist_vol":hist_vol_trade,
    "turbulence_threshold": TURBULENCE_THRESHOLD
}



In [None]:
e_train_gym = StockPortfolioEnv(df = train, **env_kwargs_train)
env_train, _ = e_train_gym.get_sb_env()

e_val_gym = StockPortfolioEnv(df = val, **env_kwargs_val)
env_val, _ = e_val_gym.get_sb_env()

e_train_full_gym = StockPortfolioEnv(df = full_train, **env_kwargs_full)
env_full_train, _ = e_train_full_gym.get_sb_env()

e_trade_gym = StockPortfolioEnv(df = trade, **env_kwargs_trade)
env_trade, _ = e_trade_gym.get_sb_env()
print("done")

<a id='5'></a>
# Part 6: Implement DRL Algorithms
* DDPG-with ALM


In [None]:
import random
from collections import deque

class Memory:
    def __init__(self, max_size):
        self.buffer = deque(maxlen=max_size)

    def push(self, state, action, reward, next_state, done):
        experience = (state, action, np.array([reward]), next_state, done)

        self.buffer.append(experience)

    def sample(self, batch_size):
        state_batch = []
        action_batch = []
        reward_batch = []
        next_state_batch = []
        done_batch = []

        batch = random.sample(self.buffer, batch_size)

        for experience in batch:
            state, action, reward, next_state, done = experience
            state_batch.append(state)
            action_batch.append(action)
            reward_batch.append(reward)
            next_state_batch.append(next_state)
            done_batch.append(done)

        state_batch = np.array(state_batch)
        action_batch = np.array(action_batch)
        reward_batch = np.array(reward_batch)
        next_state_batch = np.array(next_state_batch)

        return state_batch, action_batch, reward_batch, next_state_batch, done_batch

    def __len__(self):
        return len(self.buffer)

* using dirchilet noise instead.

In [None]:

import numpy as np


def Noise(action, action_space, kappa=10):
    """
    Apply Dirichlet noise for exploration in DDPG according to the paper.

    Args:
    - action (torch.Tensor): Original action values from the actor network.
    - action_space (gym.spaces.Box): Action space defining valid ranges.
    - kappa (float): Controls exploration variance. Higher kappa = less noise.

    Returns:
    - np.array: Modified action values with Dirichlet noise, ensuring sum = 1.
    """

    try:
        # Ensure actions are non-negative before applying Dirichlet noise
        action = torch.clamp(action, min=0.0)

        # Convert actions to numpy array for Dirichlet sampling
        action_np = action.detach().cpu().numpy()

        # Compute shape parameter: υ = κ * a
        upsilon = kappa * action_np

        # Ensure upsilon is positive and correctly shaped
        upsilon = np.maximum(upsilon, 1e-6)  # Prevent zero or negative values
        upsilon = upsilon.flatten()  # Ensure it's a 1D array

        # Debugging: Check upsilon values
        if np.any(upsilon <= 0):
            raise ValueError(f"Dirichlet parameters must be positive. Found: {upsilon}")

        # Sample ϵ from Dirichlet distribution
        epsilon = np.random.dirichlet(upsilon)

        # Compute final action: a' = a + sg(ϵ - a)
        noisy_action = action_np + (epsilon - action_np)

        # Apply StopGradient (detach the noise term)
        noisy_action = action_np + torch.tensor(noisy_action - action_np, requires_grad=False).numpy()

        # Clip extreme values to prevent instability
        noisy_action = np.clip(noisy_action, 0.0, 1.0)

        # Ensure sum = 1 for valid portfolio allocation
        noisy_action = noisy_action / noisy_action.sum()

        return noisy_action

    except ValueError as ve:
        print(f"ValueError in Dirichlet noise function: {ve}")
    except Exception as e:
        print(f"Unexpected error in Dirichlet noise function: {e}")

    # Return the original action if an error occurs
    return action.detach().cpu().numpy()




In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F


class Actor(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim, num_layers, act_fn, dr):
        super(Actor, self).__init__()

        layers = []

        if act_fn == 'relu': activation_fn = nn.ReLU()
        if act_fn == 'tanh': activation_fn = nn.Tanh()
        if act_fn == 'sigmoid': activation_fn = nn.Sigmoid()
        # print("Params Dictionary:", self.params)
        hidden_dim = int(hidden_dim)
        num_layers = int(num_layers)
        action_dim = int(action_dim)
        state_dim = int(state_dim)

        # print("state_dim:", state_dim)
        # print("action_dim:", action_dim)
        # print("hidden_dim:", hidden_dim)
        # print("num_layers:", num_layers)
        # print("act_fn:", act_fn)
        # print("dr:", dr)
        # print(f"state_dim: {state_dim}, type: {type(state_dim)}")
        # print(f"action_dim: {action_dim}, type: {type(action_dim)}")
        # print(f"hidden_dim: {hidden_dim}, type: {type(hidden_dim)}")

        # Add input layer

        layers.append(nn.Flatten())
        layers.append(nn.Linear(state_dim, hidden_dim))
        layers.append(activation_fn)
        layers.append(nn.Dropout(p=dr))

        # Add hidden layers
        for _ in range(num_layers - 2):  # -2 because we already added the input and output layers
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(activation_fn)
            layers.append(nn.Dropout(p=dr))

        # Add output layer
        layers.append(nn.Linear(hidden_dim, action_dim))
        # layers.append(nn.Dropout(p=dr))

        # Create the sequential model
        self.model = nn.Sequential(*layers)

    def forward(self, state):

        x = self.model(state)
        x = torch.tanh(x)
        # print(" actor  Network forward (((((((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))))))")
        return x


class Critic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim, num_layers, act_fn, dr):
        super(Critic, self).__init__()

        layers = []

        if act_fn == 'relu': activation_fn = nn.ReLU()
        if act_fn == 'tanh': activation_fn = nn.Tanh()
        if act_fn == 'sigmoid': activation_fn = nn.Sigmoid()
        hidden_dim = int(hidden_dim)
        num_layers = int(num_layers)
        action_dim = int(action_dim)
        state_dim = int(state_dim)

        # print("state_dim:", state_dim)
        # print("action_dim:", action_dim)
        # print("hidden_dim:", hidden_dim)
        # print("num_layers:", num_layers)
        # print("act_fn:", act_fn)
        # print("dr:", dr)
        # print(f"state_dim: {state_dim}, type: {type(state_dim)}")
        # print(f"action_dim: {action_dim}, type: {type(action_dim)}")
        # print(f"hidden_dim: {hidden_dim}, type: {type(hidden_dim)}")


        # Add input layer
        layers.append(nn.Linear(state_dim + action_dim, hidden_dim))
        layers.append(activation_fn)
        layers.append(nn.Dropout(p=dr))

        # Add hidden layers
        for _ in range(num_layers - 2):  # -2 because we already added the input and output layers
            layers.append(nn.Linear(hidden_dim, hidden_dim))
            layers.append(activation_fn)
            layers.append(nn.Dropout(p=dr))

        # Add output layer
        # layers.append(nn.Dropout(p=dr))
        layers.append(nn.Linear(hidden_dim, 1))

        # Create the sequential model
        self.model = nn.Sequential(*layers)

    def forward(self, state, action):
        """
        Forward pass of the Critic network.

        Args:
        - state (torch.Tensor): State tensor.
        - action (torch.Tensor): Action tensor.

        Returns:
        - Q-value estimation.
        """

        # 🔍 Print debug info
        # print("Critic Network forward (((((((((((((((((((((((((((((((((((((())))))))))))))))))))))))))))))))))))))")
        # print(f"State shape before reshape: {state.shape}, Action shape before reshape: {action.shape}")

        # 🔄 Flatten state if it has more than 2 dimensions (CNN case)
        if state.dim() > 2:
            state = state.view(state.shape[0], -1)  # Convert to (batch_size, features)

        # 🔄 Ensure action is 2D
        if action.dim() > 2:
            action = action.view(action.shape[0], -1)  # Convert to (batch_size, action_dim)

        # 🔍 Print final shapes
        # print(f"State shape after reshape: {state.shape}, Action shape after reshape: {action.shape}")

        # ✅ Now both state and action are 2D → Safe to concatenate
        x = torch.cat([state, action], dim=1)

        # Forward pass through Critic layers
        x = self.model(x)

        return x





class CostNetwork(nn.Module):
    """
    Neural network for estimating portfolio risk (cost).
    """
    def __init__(self, state_dim, action_dim, hidden_dim):
        super(CostNetwork, self).__init__()

        state_dim=int(state_dim)
        action_dim=int(action_dim)
        hidden_dim=int(hidden_dim)
        # print("state_dim:", state_dim)
        # print("action_dim:", action_dim)
        # print("hidden_dim:", hidden_dim)

        self.model = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)  # Outputs cost estimate
        )

    def forward(self, state, action):
        """
        Forward pass for the cost network.

        Computes:
        c_wv(s, a) = E[VaR(s, a)]  (Eq. 19 in the paper)

        Args:
        - state (torch.Tensor): State tensor with shape [batch_size, *]
        - action (torch.Tensor): Action tensor with shape [batch_size, action_dim]

        Returns:
        - Cost estimation (torch.Tensor)
        """

        # print(" cost network forward ((((((((((((((((((((((((((((((((((((((()))))))))))))))))))))))))))))))))))))))")
        # 🔍 Print debug info to check tensor shapes
        # print("state :: ", type(state) , state.shape)
        # print("action :: ", type(action) , action.shape)
        # 🔄 Flatten state if it has more than 2 dimensions
        if state.dim() > 2:
            state = state.view(state.shape[0], -1)  # Reshape to [batch_size, flattened_features]

        # 🔄 Ensure action is 2D
        if action.dim() > 2:
            action = action.view(action.shape[0], -1)  # Reshape to [batch_size, action_dim]

        # 🔍 Print final shapes
        # print(f"State shape after reshape: {state.shape}, Action shape after reshape: {action.shape}")

        # ✅ Now both state and action are 2D → Safe to concatenate
        x = torch.cat([state, action], dim=1)
        # Forward pass through the Cost network
        return self.model(x)





In [None]:
#device = 'cpu'
# Set the device (CPU or GPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from scipy.stats import norm  # For z-score

class DDPGagent:
    def __init__(self, env, params, max_memory_size=50000):
        """
        Initialize the DDPG agent with:
        - Actor-Critic Networks
        - Cost Network for risk constraints
        - Target Networks for stability
        - Lagrange multiplier for enforcing constraints
        """

        # print(params)
        print(" DDPG AGEnt Class- ++++++++++++++++++++++++++++++++++++++++++")

        # 1️⃣ Define State & Action Space Dimensions
        self.data = env.envs[0].df
        curr_state= env.envs[0].state
        # print("states_ ddpg init ::", curr_state.shape)
        actions = env.action_space.shape[0]

        # print("actions ::", actions)

        self.num_states = env.observation_space.shape[0] * env.observation_space.shape[1]
        self.num_actions = env.action_space.shape[0]
        self.gamma = params['gamma']  # Discount factor (γ)
        self.tau = params['tau']  # Soft update factor (τ)
        self.batch_size = int(params['batch_size'])
        self.env = env

        # 2️⃣ Initialize Networks
        self.actor = Actor(self.num_states, self.num_actions, params['Ahidden_dim'],
                           params['Anum_layers'], params['Aact_fn'], params['Adr']).to(device)
        self.actor_target = Actor(self.num_states, self.num_actions, params['Ahidden_dim'],
                                  params['Anum_layers'], params['Aact_fn'], params['Adr']).to(device)

        self.critic = Critic(self.num_states, self.num_actions, params['Chidden_dim'],
                             params['Cnum_layers'], params['Cact_fn'], params['Cdr']).to(device)
        self.critic_target = Critic(self.num_states, self.num_actions, params['Chidden_dim'],
                                    params['Cnum_layers'], params['Cact_fn'], params['Cdr']).to(device)

        # 3️⃣ Initialize Cost Network for Constrained Reinforcement Learning
        self.cost_network = CostNetwork(self.num_states, self.num_actions, params['Chidden_dim']).to(device)
        self.cost_target = CostNetwork(self.num_states, self.num_actions, params['Chidden_dim']).to(device)

        # Copy weights to target networks
        for target_param, param in zip(self.actor_target.parameters(), self.actor.parameters()):
            target_param.data.copy_(param.data)

        for target_param, param in zip(self.critic_target.parameters(), self.critic.parameters()):
            target_param.data.copy_(param.data)

        for target_param, param in zip(self.cost_target.parameters(), self.cost_network.parameters()):
            target_param.data.copy_(param.data)

        # 4️⃣ Training Setup
        self.memory = Memory(max_memory_size)
        self.critic_criterion = nn.MSELoss()
        self.cost_criterion = nn.MSELoss()
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=params['alr'])
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=params['clr'])
        self.cost_optimizer = optim.Adam(self.cost_network.parameters(), lr=params['clr'])

        # 5️⃣ Initialize Lagrange Multiplier for Constraint Enforcement
        self.lambda_ = torch.tensor(0.01, requires_grad=False).to(device)
        self.rho = 0.01  # Step size for updating lambda
        self.violations= 0
        self.zeta= env.envs[0].turbulence_threshold


    def get_action(self, state):
        state_tensor = torch.FloatTensor(state).to(device)
        action = self.actor.forward(state_tensor).detach().cpu()

        #action = action.detach().numpy()
        return action



    def VaR(self, states, actions, confidence_level=0.95):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        actions = actions.to(device)
        states = states.to(device)  # assume actions is already on the correct device

        batch_size = states.shape[0]  # ✅ Do NOT use `.to(device)` here
        num_assets = 30

        states = states.squeeze(1).to(device)  # [batch_size, 38, 30]
        states_n = states  # already squeezed

        cov_matrix = states[:, :num_assets, :].to(device)  # [batch_size, 30, 30]
        hist_volatility = states_n[:, -1, :].to(device)  # [batch_size, 30]

        z_score = torch.tensor(1.645, device=device)  # ✅ place tensor on the same device
        individual_VaR = z_score * hist_volatility  # [batch_size, 30]

        VaR_portfolio = torch.zeros(batch_size, device=device)  # ✅ directly initialize on device

        for i in range(num_assets):
            for j in range(num_assets):
                VaR_portfolio = VaR_portfolio + (
                    actions[:, i] * individual_VaR[:, i] *
                    actions[:, j] * individual_VaR[:, j] * cov_matrix[:, i, j]
                )

        return VaR_portfolio







    def compute_cost_target(self, states, actions, next_states, dones):
        """
        Compute the target cost using the Bellman equation.

        Equation (20):
        c_{w_v}(s, a) = VaR(s, a) + \eta (1 - d) c'_{w_v'}(s', a')
        """
        next_actions = self.actor_target.forward(next_states)  # π'(s')
        next_cost = self.cost_target.forward(next_states, next_actions.detach())  # c'_wv'(s', a')
        cost_target = self.VaR(next_states, next_actions) + self.gamma * (1 - dones) * next_cost
        return cost_target

    def update(self):
        """
        Perform one update step for the Actor, Critic, and Cost networks.
        """

        # 1️⃣ Sample a batch from the Replay Buffer
        states, actions, rewards, next_states, dones = self.memory.sample(self.batch_size)


        # print(" update states : ", type(states),  states.shape,  " action ", action.shape, type(action) )

        # Remove the singleton dimension at dim=1
        # states_n = states.squeeze(1)  # shape: (224, 38, 30)

        # # Now slicing makes sense
        # cov_mat = states_n[:, :30, :]                  # Shape: (224, 30, 30)
        # histrical_volatility = states_n[:, -1, :]


        # print("ddpg update - states_n ::", states_n.shape)
        # print("ddpg update - cov_mat_n ::", cov_mat.shape)
        # print("ddpg update - histrical_volatility_n ::", histrical_volatility.shape)

        # print("ddpg update - states_n ::", states_n)
        # print("ddpg update - cov_mat_n ::", cov_mat)
        # print("ddpg update - histrical_volatility_n ::", histrical_volatility)


        # next_states_n = next_states.squeeze(1)  # Shape: [batch_size, 38, 30]
        # next_cov_mat = next_states_n[:, :30, :]  # Shape: [batch_size, 30, 30]
        # next_hist_volatility = next_states_n[:, -1, :]
        # print("ddpg update - next_states_n ::", next_states_n.shape)
        # print("ddpg update - next_cov_mat ::", next_cov_mat.shape)
        # print("ddpg update - histor vol :: " , next_hist_volatility.shape)
        # print("ddpg update - next_states_n ::", next_states_n)
        # print("ddpg update - next_cov_mat ::", next_cov_mat)
        # print("ddpg update - histor vol :: " , next_hist_volatility)







        states = torch.FloatTensor(states).to(device)
        actions = torch.FloatTensor(actions).to(device)
        rewards = torch.FloatTensor(rewards).to(device)
        next_states = torch.FloatTensor(next_states).to(device)
        dones = torch.FloatTensor(dones).to(device)

        # 4️⃣ Compute Target Q-Value using Bellman Equation (Eq. 5)
        # Q(s, a) = r + γQ'(s', π'(s'))
        Q_target = rewards + self.gamma * (1 - dones) * self.critic_target.forward(next_states, self.actor_target.forward(next_states).detach())

        # 6️⃣ Compute Critic Loss (Eq. 6)
        # L = 1/N \sum (Q(s, a) - Q_target)^2
        # print(" critic loss calculation -start")
        critic_loss = self.critic_criterion(self.critic.forward(states, actions), Q_target.detach())

        # print("critic loss calculation end ")
        # 8️⃣ Compute Cost Network Loss (Eq. 21)
        # L_C = 1/N \sum (c_{w_v}(s, a) - VaR(s, a) - η (1 - d) c'_{w_v'}(s', a'))^2
        # print("cost loss calculation started ")
        cost_pred = self.cost_network.forward(states, actions)
        cost_target = self.compute_cost_target(states, actions, next_states, dones).detach()
        # print(" cost_ target :: " , cost_target)

        cost_loss = self.cost_criterion(cost_pred, cost_target)

        # print("cost loss calculation end ")
        # 🔟 Compute Actor Loss using Lagrangian method (Eq. 13)
        # L(w_π, λ) = -J_{w_π} + \sum \lambda_j C_{w_π, j} + \frac{\rho}{2} \sum (C_{w_π, j})^2

        # print("actor loss calculation started ")
        policy_loss = -self.critic.forward(states, self.actor.forward(states)).mean()
        constraint_penalty =  cost_target

        # print(" constraint penalty before :: ", constraint_penalty)
        # print(" constraint_penalty :::: " , constraint_penalty.shape, type(constraint_penalty))

        violations_count = (constraint_penalty > self.zeta).sum().item()  # Count how many elements violate the constraint
        # print(" violations ::: " , violations_count)
        # Update the number of violations
        self.violations  =  self.violations + violations_count


        constraint_penalty = torch.where(
            constraint_penalty <= self.zeta,
            torch.tensor(0.0, device=constraint_penalty.device, dtype=constraint_penalty.dtype),
            constraint_penalty - self.zeta
        )
        # print(" constraint penalty after :: ", constraint_penalty)

        quadratic_penalty = (self.rho / 2) * (self.cost_network.forward(states, actions) ** 2).mean().clone()
        actor_loss = policy_loss + constraint_penalty + quadratic_penalty

        self.actor_optimizer.zero_grad()
        # print(" actor_ loss ",  actor_loss.shape)
        actor_loss = actor_loss.mean()
        actor_loss.backward()
        self.actor_optimizer.step()


        # print("actor update end ")



        # 1️⃣3️⃣ Soft Update of Target Networks (Eq. 14)
        # print("soft update - critic -")
        with torch.no_grad():
          for target_param, param in zip(self.critic_target.parameters(), self.critic.parameters()):
              target_param.data= param.data * self.tau + target_param.data * (1.0 - self.tau)

          for target_param, param in zip(self.actor_target.parameters(), self.actor.parameters()):
              target_param.data = param.data * self.tau + target_param.data * (1.0 - self.tau)

          for target_param, param in zip(self.cost_target.parameters(), self.cost_network.parameters()):
              target_param.data = param.data * self.tau + target_param.data * (1.0 - self.tau)




        # print(" soft updates end")

    def buffer_fill(self, buffer_size):
      state = self.env.reset()

      # print(" buffer fill ------ ")
      # print(" buffer fil state   --- ",  state.shape)
      # print("  buffer fill state  --- ", state)
      # print(" buffer --- fill -- cov mat", state[:, :30, :].shape)

      # print(" buffer fill -----hist vol", state[:, -1, :].shape)
      # print(" buffer fill -----hist vol", state[:, -1, :])

      for _ in range(buffer_size):
        action = self.get_action(state)
        action = Noise(action, self.env.action_space)
        new_state, reward, done, _ = self.env.step(action)
        self.memory.push(state, action, reward, new_state, done)

    def trade(self, val_env, e_val_gym):
      Reward = []
      state = val_env.reset()

      for i in range(len(e_val_gym.df.index.unique())):
        action = self.get_action(state)
        next_obs, reward, done, _ = val_env.step(action.detach().numpy())
        Reward.append(reward)

        if i == (len(e_val_gym.df.index.unique()) - 2):
          account_memory = val_env.env_method(method_name="save_asset_memory")
          actions_memory = val_env.env_method(method_name="save_action_memory")

        if done[0]:
          print("hit end!")
          break
        state = next_obs

      return account_memory, actions_memory, sum(Reward)



In [None]:
#Calculate the Sharpe ratio
#This is our objective for tuning
def calculate_sharpe(df):
  #df['daily_return'] = df['account_value'].pct_change(1)
  if df['daily_return'].std() !=0:
    sharpe = (252**0.5)*df['daily_return'].mean()/ \
          df['daily_return'].std()
    return sharpe
  else:
    return 0

In [None]:
space = {
    'Ahidden_dim': hp.quniform('Ahidden_dim', 2, 512, 1),
    'Anum_layers': hp.quniform('Anum_layers', 1, 8, 1),
    'Chidden_dim': hp.quniform('Chidden_dim', 2, 512, 1),
    'Cnum_layers': hp.quniform('Cnum_layers', 1, 8, 1),

    'alr': hp.loguniform('alr', -8, -1),  # Actor learning rate
    'clr': hp.loguniform('clr', -8, -1),  # Critic learning rate
    'gamma': hp.uniform('gamma', 0.9, 0.99),  # Discount factor
    'tau': hp.uniform('tau', 0.08, 0.2),  # Soft target update rate
    'batch_size': hp.quniform('batch_size', 32, 256, 32),  # Mini-batch size

    'Aact_fn': hp.choice('Aact_fn', ['relu', 'tanh', 'sigmoid']),  # Actor activation
    'Adr': hp.uniform('Adr', 0, 0.5),  # Actor dropout
    'Cact_fn': hp.choice('Cact_fn', ['relu', 'tanh', 'sigmoid']),  # Critic activation
    'Cdr': hp.uniform('Cdr', 0, 0.5),  # Critic dropout

    # 🚀 **Newly Added Missing Hyperparameters**:
    'rho': hp.uniform('rho', 0.001, 0.1),  # Lagrange multiplier update step size
    # 'lambda_init': hp.uniform('lambda_init', 0.01, 1.0),  # Initial value of λ
    'buffer_size': hp.quniform('buffer_size', 10000, 1000000, 10000),  # Replay buffer size
    'noise_std': hp.uniform('noise_std', 0.01, 0.3),  # Exploration noise level
    'grad_clip': hp.uniform('grad_clip', 0.1, 10.0),  # Gradient clipping threshold
    'warmup_steps': hp.quniform('warmup_steps', 1000, 50000, 1000),  # Steps before training starts
    'reward_scaling': hp.uniform('reward_scaling', 0.1, 10.0)  # Reward scaling factor
}


def objective(params):
    print(params)
    # Convert hyperparameters to integers where necessary
    params['Ahidden_dim'] = int(params['Ahidden_dim'])
    params['Anum_layers'] = int(params['Anum_layers'])
    params['Chidden_dim'] = int(params['Chidden_dim'])
    params['Cnum_layers'] = int(params['Cnum_layers'])
    params['batch_size'] = int(params['batch_size'])
    params['buffer_size'] = int(params['buffer_size'])
    params['warmup_steps'] = int(params['warmup_steps'])

    model = DDPGagent(env_train, params)
    model.buffer_fill(500)
    model.update()

    account_memory, actions_memory, rewardd = model.trade(env_val, e_val_gym)
    print( f" the reward is :::::::    {rewardd}  " )

    sharpe = calculate_sharpe(account_memory[0])
    return -sharpe
    # return -reward[0]

In [None]:
best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals= 10 , trials=Trials()) #max_evals = 500

In [None]:
best['Aact_fn'] = ['relu', 'tanh', 'sigmoid'][best['Aact_fn']]
best['Cact_fn'] = ['relu', 'tanh', 'sigmoid'][best['Cact_fn']]
best

In [None]:
# best= {'Aact_fn': 'sigmoid',
#  'Adr': np.float64(0.40156739994186963),
#  'Ahidden_dim': np.float64(320.0),
#  'Anum_layers': np.float64(2.0),
#  'Cact_fn': 'tanh',
#  'Cdr': np.float64(0.1254413062439358),
#  'Chidden_dim': np.float64(251.0),
#  'Cnum_layers': np.float64(5.0),
#  'alr': np.float64(0.0009107119585481395),
#  'batch_size': np.float64(128.0),
#  'buffer_size': np.float64(750000.0),
#  'clr': np.float64(0.011909541457606108),
#  'gamma': np.float64(0.9802858202890133),
#  'grad_clip': np.float64(4.985696483611227),
#  'lambda_init': np.float64(0.4925723364653761),
#  'noise_std': np.float64(0.23649933117828195),
#  'reward_scaling': np.float64(4.922508316091725),
#  'rho': np.float64(0.0701825073670665),
#  'tau': np.float64(0.1185136176470432),
#  'warmup_steps': np.float64(9000.0)}

In [None]:
agent = DDPGagent(env_full_train, best)

batch_size = agent.batch_size

In [None]:
rewards = []
avg_rewards = []
num_episodes = 5 #1000

torch.autograd.set_detect_anomaly(True)
for episode in range(num_episodes):

    state = env_full_train.reset()
    episode_reward = 0
    done = False
    # print(state.shape, type(state))
    # state = state.reshape(1, 1, 39, 30)
    # print((torch.tensor( np.expand_dims(state, axis=1))).dim)



    print(f"Episode: {episode+1}")
    while not done:

        # print(i)
        # print(f"done  : {done} ")
        action = agent.get_action(state)
        action = Noise(action, env_full_train.action_space)
        new_state, reward, done ,info = env_full_train.step(action)
        # done= terminated or truncated
        agent.memory.push(state, action, reward, new_state, done)

        if len(agent.memory) > batch_size:
            agent.update()

        state = new_state
        episode_reward  = episode_reward + reward

        if done:
          #  sys.stdout.write("episode: {}, reward: {}, average _reward: {} \n".format(episode, np.round(episode_reward, decimals=2), np.mean(rewards[-10:])))
            break

    # agent.lambda_ = agent.lambda_ + agent.rho * agent.cost_network.forward(torch.tensor( np.expand_dims(state, axis=1)), agent.get_action(state)).mean()
    # agent.lambda_ = agent.lambda_ + agent.rho * agent.cost_network.forward(
                  #     torch.tensor(np.expand_dims(state, axis=1), dtype=torch.float32),
                  #     agent.get_action(state)
                  # ).mean().detach()

    device = next(agent.cost_network.parameters()).device  # Get the device of the cost network

    state_tensor = torch.tensor(np.expand_dims(state, axis=1), dtype=torch.float32, device=device)
    action_tensor = agent.get_action(state).to(device)  # Ensure action is also on same device

    agent.lambda_ = agent.lambda_ + agent.rho * agent.cost_network.forward(
        state_tensor,
        action_tensor
    ).mean().detach().to(device)

    agent.rho= agent.rho * 1.008

    rewards.append(episode_reward)
    avg_rewards.append(np.mean(rewards[-10:]))
    print(f"Episode: {episode+1}, Total Reward: {episode_reward}")
    print(" violations : " ,  agent.violations)




In [None]:
import pickle
import matplotlib.pyplot as plt
from google.colab import files
# Save to file
with open("ddpg_agent.pkl", "wb") as f:
    pickle.dump(agent, f)

files.download("ddpg_agent.pkl")

In [None]:


# Create a figure and axis
fig, ax = plt.subplots()

# Plot the rewards
ax.plot(rewards, label='Rewards')
ax.plot(avg_rewards, label='Average Rewards')

# Label the axes
ax.set_xlabel('Episode')
ax.set_ylabel('Reward')

# Add legend
ax.legend()

# Show the plot
plt.show()
fig.savefig('rewards_plot.png')
files.download("rewards_plot.png")

# Now `fig` contains the plot and can be saved or manipulated


## Trading
Assume that we have $1,000,000 initial capital at 2019-01-01. We use the A2C model to trade Dow jones 30 stocks.

import the trading dataframe  and hist_vol dataframe

In [None]:
indices= [sensex_ticker, Dow_30, dax_30, nikkei_top30_symbols, FTSE_top30, twse_top30, hang_seng_symbols, brazil_tickers, ibex35_tickers, bist100_top30_tickers ]

# indices = {
#     "sensex_ticker": sensex_ticker,
#     "Dow_30": Dow_30,
#     "dax_30": dax_30,
#     "nikkei_top30_symbols": nikkei_top30_symbols,
#     "FTSE_top30": FTSE_top30,
#     "twse_top30": twse_top30,
#     "hang_seng_symbols": hang_seng_symbols,
#     "brazil_tickers": brazil_tickers,
#     "ibex_35_tickers": ibex35_tickers,
#     "bist100_top30_tickers": bist100_top30_tickers
# }

file_name= "____index that u  want to  trade into __ "

df= pd.read_csv(f'/content/stock_data_{file_name}')
hist_vol_trade = pd.read_csv(f'/content/hist_vol_{file_name}')


In [None]:
trade = data_split(df,'2023-01-01', '2025-02-28')

In [None]:
TURBULENCE_THRESHOLD
env_kwargs_trade = {
    "hmax": 100,
    "initial_amount": 1000000,
    "transaction_cost_pct": 0.001,
    "state_space": state_space,
    "stock_dim": stock_dimension,
    "tech_indicator_list": INDICATORS,
    "action_space": stock_dimension,
    "reward_scaling": 1e-4,
    "hist_vol":hist_vol_trade,
    "turbulence_threshold": TURBULENCE_THRESHOLD
}

In [None]:
e_trade_gym = StockPortfolioEnv(df = trade, **env_kwargs_trade)
test_env, test_obs = e_trade_gym.get_sb_env()

In [None]:
account_memory, actions_memory, rewardd = agent.trade(env_trade, e_trade_gym)
violations= agent.violations

print("violations : " , violations)
print(" reward :: " , rewardd)

In [None]:
calculate_sharpe(account_memory[0])

In [None]:
account_memory[0].head()

In [None]:
account_memory[0].to_csv('/content/df_daily_return.csv')
files.download('df_daily_return.csv')

In [None]:
actions_memory[0].head()

In [None]:
actions_memory[0].to_csv('/content/df_actions.csv')
files.download('df_actions.csv')

In [None]:
df_daily_return = account_memory[0]

<a id='6'></a>
# Part 7: Backtest Our Strategy
Backtesting plays a key role in evaluating the performance of a trading strategy. Automated backtesting tool is preferred because it reduces the human error. We usually use the Quantopian pyfolio package to backtest our trading strategies. It is easy to use and consists of various individual plots that provide a comprehensive image of the performance of a trading strategy.

In [None]:
#calculate_portfolio_minimum_variance
portfolio = pd.DataFrame(index = range(1), columns = unique_trade_date)
initial_capital = 1000000
portfolio.loc[0,unique_trade_date[0]] = initial_capital

# Define transaction cost rate
transaction_cost_rate = 0.005

for i in range(len( unique_trade_date)-1):
    df_temp = df[df.date==unique_trade_date[i]].reset_index(drop=True)
    df_temp_next = df[df.date==unique_trade_date[i+1]].reset_index(drop=True)
    #Sigma = risk_models.sample_cov(df_temp.return_list[0])
    #calculate covariance matrix
    Sigma = df_temp.return_list[0].cov()
    #portfolio allocation
    ef_min_var = EfficientFrontier(None, Sigma,weight_bounds=(0, 0.1))
    #minimum variance
    raw_weights_min_var = ef_min_var.min_volatility()
    #get weights
    cleaned_weights_min_var = ef_min_var.clean_weights()

    #current capital
    cap = portfolio.iloc[0, i]
    #current cash invested for each stock
    current_cash = [element * cap for element in list(cleaned_weights_min_var.values())]
    # current held shares
    current_shares = list(np.array(current_cash)
                                      / np.array(df_temp.close))
    # next time period price
    next_price = np.array(df_temp_next.close)

    # Calculate next portfolio value without transaction cost
    next_value = np.dot(current_shares, next_price)

    # Calculate transaction costs
    new_shares = current_cash / next_price
    share_differences = np.abs(new_shares - current_shares)
    transaction_cost = np.sum(share_differences * next_price * transaction_cost_rate)

    # Deduct transaction cost from portfolio value
    portfolio.iloc[0, i + 1] = next_value - transaction_cost

portfolio=portfolio.T
portfolio.columns = ['account_value']

In [None]:
def calculate_daily_return(current_value, previous_value):
    return (current_value - previous_value) / previous_value

# Calculate daily return and add it as a new column
daily_returns = [0]  # Daily return for the first day is assumed to be 0
for i in range(1, len(portfolio)):
    current_value = portfolio['account_value'][i]
    previous_value = portfolio['account_value'][i - 1]
    daily_returns.append(calculate_daily_return(current_value, previous_value))

portfolio['daily_return'] = daily_returns

print(portfolio)

In [None]:
portfolio.head()

In [None]:
Agent =(df_daily_return_T.daily_return+1).cumprod()-1

In [None]:
min_var_cumpod =(portfolio.account_value.pct_change()+1).cumprod()-1

In [None]:
portfolio.drop(columns=['account_value'], inplace=True)
portfolio.to_csv('Markowitz_Portfolio_Return_'+ Market +'.csv')
files.download('Markowitz_Portfolio_Return_'+ Market +'.csv')

In [None]:
Baseline =(baseline_returns+1).cumprod()-1

## Plotly: DRL, Min-Variance, DJIA

In [None]:
%pip install plotly

In [None]:
from datetime import datetime as dt

import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go

In [None]:
time_ind = pd.Series(df_daily_return_T.date)

In [None]:
trace0_portfolio = go.Scatter(x = time_ind, y = Agent, mode = 'lines', name = 'Agent (Portfolio Allocation)')

trace1_portfolio = go.Scatter(x = time_ind, y = Baseline, mode = 'lines', name = 'Baseline')
trace2_portfolio = go.Scatter(x = time_ind, y = min_var_cumpod, mode = 'lines', name = 'Min-Variance')
#trace3_portfolio = go.Scatter(x = time_ind, y = a2c_cumpod_esg, mode = 'lines', name = 'ESG-A2C (Portfolio Allocation)')
#trace3_portfolio = go.Scatter(x = time_ind, y = ddpg_cumpod, mode = 'lines', name = 'DDPG')
#trace4_portfolio = go.Scatter(x = time_ind, y = addpg_cumpod, mode = 'lines', name = 'Adaptive-DDPG')
#trace5_portfolio = go.Scatter(x = time_ind, y = min_cumpod, mode = 'lines', name = 'Min-Variance')

#trace4 = go.Scatter(x = time_ind, y = addpg_cumpod, mode = 'lines', name = 'Adaptive-DDPG')

#trace2 = go.Scatter(x = time_ind, y = portfolio_cost_minv, mode = 'lines', name = 'Min-Variance')
#trace3 = go.Scatter(x = time_ind, y = spx_value, mode = 'lines', name = 'SPX')

In [None]:
fig = go.Figure()
fig.add_trace(trace0_portfolio)

fig.add_trace(trace1_portfolio)

fig.add_trace(trace2_portfolio)

#fig.add_trace(trace3_portfolio)

fig.update_layout(
    legend=dict(
        x=0,
        y=1,
        traceorder="normal",
        font=dict(
            family="sans-serif",
            size=15,
            color="black"
        ),
        bgcolor="White",
        bordercolor="white",
        borderwidth=2

    ),
)
#fig.update_layout(legend_orientation="h")
fig.update_layout(title={
        #'text': "Cumulative Return using FinRL",
        'y':0.85,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
#with Transaction cost
#fig.update_layout(title =  'Quarterly Trade Date')
fig.update_layout(
#    margin=dict(l=20, r=20, t=20, b=20),

    paper_bgcolor='rgba(1,1,0,0)',
    plot_bgcolor='rgba(1, 1, 0, 0)',
    #xaxis_title="Date",
    yaxis_title="Cumulative Return",
xaxis={'type': 'date',
       'tick0': time_ind[0],
        'tickmode': 'linear',
       'dtick': 86400000.0 *80}

)
fig.update_xaxes(showline=True,linecolor='black',showgrid=True, gridwidth=1, gridcolor='LightSteelBlue',mirror=True)
fig.update_yaxes(showline=True,linecolor='black',showgrid=True, gridwidth=1, gridcolor='LightSteelBlue',mirror=True)
fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='LightSteelBlue')

fig.show()