
<br>


# <span style="color:#5E6997"> Unleashing the Power of Deep Reinforcement Learning for Trading </span>
## <span style="color:#5E6997"> Exploring the Intersection of Reinforcement Learning and Neural Networks </span>


<br>

### **Table of Contents**
* [<span style="color:#A690A4"> 1. Introduction](#intro)
* [<span style="color:#A690A4"> 2. Data Engineering](#data_eng)    
* [<span style="color:#A690A4"> 3. DRL Environment Architecture](#env)
* [<span style="color:#A690A4"> 4. DRL Agent Architecture](#agent)

<br>
    
## <span style="color:#5E6997"> Project Overview </span> <a class="anchor" id="intro"></a>

    
### Introduction
Financial markets are a dynamic, complex ecosystem where traditional trading strategies are continually challenged by evolving market conditions, global events, and intricate patterns that define asset prices. In response to this ever-changing landscape, our team has embarked on a pioneering journey to develop a sophisticated DRL-based algorithmic trading system.

Deep Reinforcement Learning, a subset of artificial intelligence, offers the promise of adaptive, self-improving trading strategies. Inspired by the way humans learn through trial and error, our DRL model navigates financial markets, learning and adapting to new information in real-time, all with the goal of maximizing returns while managing risk.

This project represents a convergence of finance, machine learning, and technology. Our DRL model is designed to make informed trading decisions, considering historical data, market indicators, and other relevant factors, ultimately offering the potential to outperform traditional trading methods. The ability to harness the power of neural networks, paired with reinforcement learning principles, promises a new era in algorithmic trading.

In this project, we will explore the development and implementation of our DRL model, highlighting its capabilities, challenges, and real-world implications in the realm of financial markets. We invite you to journey with us through the intricacies of this innovative approach, where artificial intelligence meets finance, and together, we aim to shape the future of trading.

<br>

### Objective & Scope
- Collect, clean & analyze Bitcoin crypto market data
- Features Engineering
- Build a DRL trading environment
- Train & improve the agent

<br>

### Deep Reinformencement Learning Model

- **Reinforcement Learning Algorithm (RLA):** PPO (Proximal Policy Optimization)

- **Neural Network Architecture for Policy (Agent's action):** MLP Policy (Multi-Layer Perceptron Policy)
    - **Architecture Details:**
        - **Input Layer:** The MLP policy takes observations from the environment as input.
        - **Hidden Layers:** The network typically consists of one or more hidden layers, each containing multiple neurons. The specific number of layers and neurons may vary depending on the complexity of the problem.
        - **Activation Functions:** Each layer in the MLP policy may utilize activation functions like ReLU (Rectified Linear Unit) or Tanh.
        - **Output Layer**: The output layer provides the probabilities or values associated with different actions the agent can take.
    - **Training Process:**
        - The model is trained using PPO, which involves optimizing the policy network to maximize cumulative rewards while ensuring that policy updates are within a safe margin.

<br>
    
Enjoy and looking forward to read your feedback.
    
F.G.J

## <span style="color:#5E6997"> Data Engineering </span> <a class="anchor" id="data_eng"></a>

In [4]:
pip install stable-baselines3

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install 'shimmy>=0.2.1'

Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: "'shimmy": Expected package name at the start of dependency specifier
    'shimmy
    ^


In [6]:
import pandas as pd
import gym
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
import warnings
warnings.filterwarnings("ignore")




In [7]:
df = pd.read_csv('./crypto_data.csv')

In [8]:
df.head()

Unnamed: 0,date,symbol,open,high,low,close,volume usdt,tradecount,token,hour,day
0,2020-12-25 05:00:00,1INCHUSDT,0.2,3.0885,0.2,2.5826,35530516,48768,1INCH,5,Friday
1,2020-12-25 06:00:00,1INCHUSDT,2.5824,2.69,2.2249,2.5059,22440875,31099,1INCH,6,Friday
2,2020-12-25 07:00:00,1INCHUSDT,2.5152,2.887,2.3609,2.6237,21300426,33001,1INCH,7,Friday
3,2020-12-25 08:00:00,1INCHUSDT,2.6318,2.8247,2.465,2.6134,17491813,30459,1INCH,8,Friday
4,2020-12-25 09:00:00,1INCHUSDT,2.6104,2.7498,2.5629,2.6365,9919400,21023,1INCH,9,Friday


In [9]:
df.columns

Index(['date', 'symbol', 'open', 'high', 'low', 'close', 'volume usdt',
       'tradecount', 'token', 'hour', 'day'],
      dtype='object')

In [10]:
start_date = '2020-08-17 04:00:00'
end_date = '2023-10-19 23:00:00'

In [11]:
data_df = df.copy(deep=True)

In [12]:
data_df = data_df[(data_df['token'] == 'BTC') & (data_df['date'] >= start_date) & (data_df['date'] <= end_date)]

In [13]:
data_df.columns

Index(['date', 'symbol', 'open', 'high', 'low', 'close', 'volume usdt',
       'tradecount', 'token', 'hour', 'day'],
      dtype='object')

### Mapping 'day' days to numerical values

In [14]:
day_mapping = {
    'Monday': 1,
    'Tuesday': 2,
    'Wednesday': 3,
    'Thursday': 4,
    'Friday': 5,
    'Saturday': 6,
    'Sunday': 7
}

data_df['day'] = data_df['day'].apply(lambda x: day_mapping[x])

### Creating new columns EMAs (EMA 13 25 32 100 200)

In [15]:
data_df['ema_13'] = data_df['close'].ewm(span=13).mean()
data_df['ema_25'] = data_df['close'].ewm(span=25).mean()
data_df['ema_32'] = data_df['close'].ewm(span=32).mean()
data_df['ema_100'] = data_df['close'].ewm(span=100).mean()
data_df['ema_200'] = data_df['close'].ewm(span=200).mean()

### Creating Candle Volatility Against Price Close (high - low / close)

In [16]:
data_df['vol_close'] = (data_df['high'] - data_df['low']) / data_df['close']

### Creating EMA for Candle Volatility/Price 

In [17]:
# EMA for 2 hours (candles)
data_df['vol_close_ema_3'] = data_df['vol_close'].ewm(span=3, adjust=False).mean()

# EMA for 4 hours (candles)
data_df['vol_close_ema_6'] = data_df['vol_close'].ewm(span=6, adjust=False).mean()

# EMA for 8 hours ((candles)
data_df['vol_close_ema_12'] = data_df['vol_close'].ewm(span=12, adjust=False).mean()

In [18]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27812 entries, 343607 to 371418
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   date              27812 non-null  object 
 1   symbol            27812 non-null  object 
 2   open              27812 non-null  float64
 3   high              27812 non-null  float64
 4   low               27812 non-null  float64
 5   close             27812 non-null  float64
 6   volume usdt       27812 non-null  int64  
 7   tradecount        27812 non-null  int64  
 8   token             27812 non-null  object 
 9   hour              27812 non-null  int64  
 10  day               27812 non-null  int64  
 11  ema_13            27812 non-null  float64
 12  ema_25            27812 non-null  float64
 13  ema_32            27812 non-null  float64
 14  ema_100           27812 non-null  float64
 15  ema_200           27812 non-null  float64
 16  vol_close         27812 non-null  float

In [19]:
data_df['date'] = pd.to_datetime(data_df['date'])

In [20]:
columns_to_drop = ['symbol', 'token']
data_df = data_df.drop(columns=columns_to_drop)

In [21]:
#data_df = data_df.astype(float)
#data_df.set_index('date', inplace=True)

In [22]:
data_df.to_csv('data_featured.csv', index=False)

In [23]:
train_df = data_df.copy(deep=True)
train_df.reset_index(drop=True, inplace=True)

In [24]:
train_df.head()

Unnamed: 0,date,open,high,low,close,volume usdt,tradecount,hour,day,ema_13,ema_25,ema_32,ema_100,ema_200,vol_close,vol_close_ema_3,vol_close_ema_6,vol_close_ema_12
0,2020-08-17 04:00:00,11844.72,11858.91,11802.35,11809.38,17196539,28800,4,1,11809.38,11809.38,11809.38,11809.38,11809.38,0.004789,0.004789,0.004789,0.004789
1,2020-08-17 05:00:00,11809.39,11836.9,11790.0,11800.01,19958274,28571,5,1,11804.334615,11804.5076,11804.548594,11804.64815,11804.671575,0.003975,0.004382,0.004557,0.004664
2,2020-08-17 06:00:00,11800.0,11846.74,11785.23,11806.37,19327915,29762,6,1,11805.119921,11805.178699,11805.194058,11805.233617,11805.243387,0.00521,0.004796,0.004743,0.004748
3,2020-08-17 07:00:00,11806.37,11843.01,11792.32,11807.21,22698672,33111,7,1,11805.768697,11805.749021,11805.746245,11805.742633,11805.74244,0.004293,0.004545,0.004615,0.004678
4,2020-08-17 08:00:00,11806.94,11885.0,11806.91,11868.77,28031116,37458,8,1,11822.518351,11820.447106,11819.974205,11818.857316,11818.601316,0.006579,0.005562,0.005176,0.004971


In [25]:
#df_test = data_df[10000:]
#df_test = df_test.reset_index(drop=True)

## <span style="color:#5E6997"> DRL Environment Architecture </span> <a class="anchor" id="env"></a>

**Environment Controls:**
- **data**: A DataFrame containing the financial data. (passed when called the function 'env' CryptoTradingEnv(data_df))
- **take_profit_range**: A tuple representing the range for take profit levels.
- **stop_loss_range**: A tuple representing the range for stop loss levels.
- **max_stop_loss**: The maximum allowable stop loss.
- **position_size**: The maximum allowable position size.
- **initial_balance**: The initial capital.
- **balance**: Current balance.
- **position**: The current position (0 for no position, 1 for long position).
- **position_open**: The price at which the position was opened.
- **num_trades**: The number of trades.
- **profit_loss**: The cumulative profit/loss.
- **position_size**: The calculated position size as a percentage of the current balance (controlled within the code).
 
 <br>
 
**Obversations:**
- open, high, low, close, volume usdt, hour, day, ema_13, ema_25, ema_32, ema_100, ema_200, vol_close_ema_3, vol_close_ema_6, vol_close_ema_12
 
 <br>
 
**Initial Reward Calculation:**
- When an action is taken, the reward variable is initialized to 0.
- The variable trade_outcome is used to calculate the profit or loss associated with the most recent trade. It is initialized to 0.
 
 <br>
 
**Opening and Closing Trades:**
- If the agent takes the action to open a position (action code 1), and it is not already in a trade (self.position == 0), it calculates the position_size based on the balance or maximum position size, and records the opening price.
- If the agent takes the action to close a position (action code 2), and it is currently in a trade (self.position == 1), it calculates the trade outcome as the difference between the opening price and the current market price.
- The profit_loss is updated with the trade outcome, representing the cumulative profit or loss across all trades.
- The balance is adjusted based on the trade outcome.
 
 <br>
 
**Take Profit and Stop Loss:**
- The code checks whether the trade outcome is greater than 0 (a winning trade) and whether it reaches the take profit level defined by self.take_profit_range.
- If the trade reaches the take profit level (>0), a reward of 0.5 is assigned to indicate a successful trade.
- If the trade is closed within the take profit range, an extra reward of 1 is assigned to indicate a successful trade.
- If the trade outcome is less than 0 (a losing trade), the code checks for two conditions:
- If the trade outcome is egual or exceeds the maximum allowable stop loss (self.max_stop_loss * self.balance), a malus reward of -0.7 is assigned, indicating that the trade reached the maximum stop loss.
- If the trade outcome falls within the stop loss range defined by self.stop_loss_range, a reward of -1 is assigned, indicating that the trade reached the stop loss.
 
 <br>
 
**Trade Outcome Within Range:**
- If trade_within_range is True, it means that the trade outcome fell within the specified take profit range. In this case, self.winning_trade_within_range is incremented to keep track of such trades.
 
 <br>
 
**Cumulative Returns and Episode Tracking:**
- The reward is also added to the episode_returns list to track rewards for the current episode.
- The cumulative_returns is updated with the reward to keep track of cumulative returns.
 
 <br>
 
**Completing an Episode:**
- If the current step reaches the end of the trading data (self.current_step >= self.n_steps), the done flag is set to True, indicating that the episode has ended.

# DRL Environment v1

In [26]:
class CryptoTradingEnv(gym.Env):
    def __init__(self, data, take_profit_position_range=(0.10, 0.80), stop_loss_position_range=(0.00, 0.15), max_stop_loss_position=0.30):
        super(CryptoTradingEnv, self).__init__()

        self.data = data
        self.n_steps = len(data)
        self.current_step = 0
        self.initial_balance = 10000 
        self.balance = self.initial_balance
        self.position = 0
        self.position_open = 0 # Price at which the position was opened
        self.num_trades = 0  # Number of trades
        self.profit_loss = 0  # PnL Profit/Loss
        self.max_stop_loss_position = max_stop_loss_position 

        self.take_profit_position_range = take_profit_position_range
        self.stop_loss_position_range = stop_loss_position_range

        # 0=Hold, 1=Open Position, 2=Close Position
        self.action_space = gym.spaces.Discrete(3)

        # Observations
        n_features = 15 
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=(n_features,))

        self.episode_returns = []  
        self.cumulative_returns = 0 
        self.winning_trades = 0  
        self.losing_trades = 0 
        self.hourly_returns = []
        
        self.overall_rewards = 0


    def step(self, action):
        self.current_step += 1
        done = False

        reward = 0
        trade_outcome = 0
        trade_within_range = False

        if action == 1:  # Open Position
            if self.position == 0:  # Only open a position if not already in a trade
                position_size = 0.05 * self.balance

                self.position_open = self.data.loc[self.current_step, 'open']
                self.position = 1
                self.num_trades += 1
                print(f"Opened trade at step {self.current_step} with position size: {position_size:.2f}")
        elif action == 2:  # Close Position
            if self.position == 1:  # Only close a position if currently in a trade
                position_close = self.data.loc[self.current_step, 'open']
                trade_outcome = position_close - self.position_open
                self.profit_loss += trade_outcome
                self.position = 0
                self.position_open = 0
                print(f"Closed trade at step {self.current_step}")
                print(f"------------------------------> Trade outcome: {trade_outcome}")

                self.balance += trade_outcome
                
                # Check for take profit and stop loss
                if trade_outcome > 0:
                    if trade_outcome >= self.take_profit_position_range[0] * self.position:
                        reward = 1  # Trade reached take profit
                        self.winning_trades += 1  # +1 winning trades count
                        if trade_outcome <= self.take_profit_position_range[1] * self.position and trade_outcome >= self.take_profit_position_range[0] * self.position:
                           # self.trade_within_range = True
                            print("Trade reached take profit")
                elif trade_outcome < 0:
                    if abs(trade_outcome) >= self.max_stop_loss_position * self.position:
                        reward = -1.7  # Trade reached the maximum stop loss
                        self.losing_trades += 1  # +1 losing trades count
                        print("Trade reached the maximum stop loss")
                    elif abs(trade_outcome) >= self.stop_loss_position_range[0] * self.position:
                        reward = -1  # Trade reached the stop loss
                        self.losing_trades += 1  # +1 losing trades count
                        print("Trade reached stop loss")
                    else:
                        # No reward (positive and negative) if the trade closed with no loss
                        reward = 0
                        print("Trade closed with no loss")

#                if trade_within_range:
#                    self.winning_trade_within_range += 1

    # reward based on profit/loss of the last trade
#        reward += trade_outcome
        if trade_outcome > 0:
            reward += 0.5
        print(f"Reward: {reward}")

        # reward for this episode
        self.episode_returns.append(reward)

        # cumulative returns
        self.cumulative_returns += reward

        # hourly returns for daily return calculation
#        self.hourly_returns.append(reward)

        self.overall_rewards += reward  

        if self.current_step >= self.n_steps:
            done = True

        next_state = self.get_observation()
        return next_state, reward, done, {}


    def reset(self):
        self.current_step = 0
        self.position = 0
        self.position_open = 0
        episode_return = np.sum(self.episode_returns)
        self.episode_returns = []
        self.cumulative_returns += episode_return
        self.current_week_start = 0
        return self.get_observation()

    def render(self, mode='human'):
        if mode == 'human':
            print(f"Step: {self.current_step}")
            print(f"Open Position: {self.position}")
            print(f"Trades: {self.num_trades} | Profit/Loss: {self.profit_loss:.2f}")
            print(f"Balance: {self.balance:.2f}")
            print(f"Winning Trades: {self.winning_trades} | Losing Trades: {self.losing_trades}")
            print(f"Overall Rewards: {self.overall_rewards:.2f}")
            print(self.data.loc[self.current_step])

    def close(self):
        pass

    def get_observation(self):
        obs = self.data.loc[self.current_step, [
            'open', 'high', 'low', 'close', 'volume usdt', 'hour', 'day',
            'ema_13', 'ema_25', 'ema_32', 'ema_100', 'ema_200', 'vol_close_ema_3', 'vol_close_ema_6', 'vol_close_ema_12'
        ]].values.astype(np.float32)
        return obs / obs.max()


## <span style="color:#5E6997"> DRL Agent Architecture </span> <a class="anchor" id="agent"></a>

# DRL Agent v1

In [28]:
num_episodes = 5000 
print_interval = 100

env = CryptoTradingEnv(train_df)

obs = env.reset()

agent = PPO("MlpPolicy", DummyVecEnv([lambda: env]), verbose=1, tensorboard_log="./ppo_logs/")

max_steps = 50000 # I was getting error for out of index so i limited the max step

for episode in range(num_episodes):
    obs = np.array(obs, dtype=np.float32)
    action, _ = agent.predict(obs, deterministic=False)

    remaining_steps = min(max_steps - env.current_step, env.n_steps)

    if remaining_steps > 0:
        obs, reward, done, _ = env.step(action)
        env.render()

#    if done:
#        obs = env.reset()

    if (episode + 1) % print_interval == 0:
        print(f"Episode {episode + 1}: Final Rewards: {env.cumulative_returns:.2f} | Total Trades: {env.num_trades}")

Using cpu device
Reward: 0
Step: 1
Open Position: 0
Trades: 0 | Profit/Loss: 0.00
Balance: 10000.00
Winning Trades: 0 | Losing Trades: 0
Overall Rewards: 0.00
date                2020-08-17 05:00:00
open                           11809.39
high                            11836.9
low                             11790.0
close                          11800.01
volume usdt                    19958274
tradecount                        28571
hour                                  5
day                                   1
ema_13                     11804.334615
ema_25                       11804.5076
ema_32                     11804.548594
ema_100                     11804.64815
ema_200                    11804.671575
vol_close                      0.003975
vol_close_ema_3                0.004382
vol_close_ema_6                0.004557
vol_close_ema_12               0.004664
Name: 1, dtype: object
Reward: 0
Step: 2
Open Position: 0
Trades: 0 | Profit/Loss: 0.00
Balance: 10000.00
Winning Trades:

In [29]:
# %load_ext tensorboard

In [30]:
# %tensorboard --logdir ./ppo_logs/

### Save configuration

In [31]:
agent.save('v1_ppodrl')

# Test Validation Phase

### Import / load validation data

In [32]:
val_data = train_df

In [33]:
val_data = val_data.reset_index(drop=True)

In [34]:
val_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27812 entries, 0 to 27811
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              27812 non-null  datetime64[ns]
 1   open              27812 non-null  float64       
 2   high              27812 non-null  float64       
 3   low               27812 non-null  float64       
 4   close             27812 non-null  float64       
 5   volume usdt       27812 non-null  int64         
 6   tradecount        27812 non-null  int64         
 7   hour              27812 non-null  int64         
 8   day               27812 non-null  int64         
 9   ema_13            27812 non-null  float64       
 10  ema_25            27812 non-null  float64       
 11  ema_32            27812 non-null  float64       
 12  ema_100           27812 non-null  float64       
 13  ema_200           27812 non-null  float64       
 14  vol_close         2781

In [35]:
val_data.head()

Unnamed: 0,date,open,high,low,close,volume usdt,tradecount,hour,day,ema_13,ema_25,ema_32,ema_100,ema_200,vol_close,vol_close_ema_3,vol_close_ema_6,vol_close_ema_12
0,2020-08-17 04:00:00,11844.72,11858.91,11802.35,11809.38,17196539,28800,4,1,11809.38,11809.38,11809.38,11809.38,11809.38,0.004789,0.004789,0.004789,0.004789
1,2020-08-17 05:00:00,11809.39,11836.9,11790.0,11800.01,19958274,28571,5,1,11804.334615,11804.5076,11804.548594,11804.64815,11804.671575,0.003975,0.004382,0.004557,0.004664
2,2020-08-17 06:00:00,11800.0,11846.74,11785.23,11806.37,19327915,29762,6,1,11805.119921,11805.178699,11805.194058,11805.233617,11805.243387,0.00521,0.004796,0.004743,0.004748
3,2020-08-17 07:00:00,11806.37,11843.01,11792.32,11807.21,22698672,33111,7,1,11805.768697,11805.749021,11805.746245,11805.742633,11805.74244,0.004293,0.004545,0.004615,0.004678
4,2020-08-17 08:00:00,11806.94,11885.0,11806.91,11868.77,28031116,37458,8,1,11822.518351,11820.447106,11819.974205,11818.857316,11818.601316,0.006579,0.005562,0.005176,0.004971


In [None]:
# pre_trained_agent = PPO.load('v1_ppodrl')

In [None]:
# env_test = CryptoTradingEnv(val_data)
# obs = env_test.reset()

# num_episodes = 1000

# for episode in range(num_episodes):
#     obs = np.array(obs, dtype=np.float32)
#     action, _ = agent.predict(obs, deterministic=True) 

#     remaining_steps = min(max_steps - env_test.current_step, env_test.n_steps)

#     if remaining_steps > 0:
#         obs, reward, done, _ = env_test.step(action)
#         env_test.render()

#     if (episode + 1) % print_interval == 0:
#         print(f"Episode {episode + 1}: Final Rewards: {env_test.cumulative_returns:.2f} | Total Trades: {env_test.num_trades}")

### Thank you for taking the time to go through my notebook