# Double Q-Learning Agent for Stock Trading

This repository contains an implementation of a Q-learning agent designed to make stock trading decisions based on historical price data. The agent learns an optimal trading strategy through interaction with the market environment using the Q-learning algorithm.

## Agent Architecture

The agent consists of a deep neural network that takes in a state representation and outputs Q-values for each possible action (buy, sell, hold). The state is represented by a sliding window of price differences over a specified window size. The neural network architecture includes the following components:

- Input layer: Accepts the state representation.
- Hidden layer: Applies the ReLU activation function to the input.
- Output layer: Produces the Q-values for each action using a linear activation function.

The agent also maintains a target network, which is a copy of the main network used for stable Q-value estimation during training.

## Q-Learning Process

The Q-learning process follows these steps:

1. **Initialization**: The agent's neural networks (main and target) are initialized with random weights, and the replay memory is emptied.

2. **State Representation**: At each time step, the agent observes the current state of the market, which is represented by a sliding window of price differences.

3. **Action Selection**: The agent selects an action based on an epsilon-greedy policy. With probability epsilon, the agent explores by selecting a random action, and with probability 1-epsilon, the agent exploits by selecting the action with the highest Q-value.

4. **State Transition**: The agent executes the selected action and observes the next state and the reward received from the market.

5. **Replay Memory**: The agent stores the transition (state, action, reward, next state, done) in the replay memory.

6. **Q-Value Update**: The agent samples a batch of transitions from the replay memory and updates the Q-values using the Q-learning update rule. The target Q-value for each transition is calculated based on the reward and the maximum Q-value of the next state obtained from the target network.

7. **Neural Network Update**: The agent's main neural network is updated using the sampled batch of transitions and the Q-learning loss function. The optimizer adjusts the network's weights to minimize the loss.

8. **Target Network Update**: After a specified number of steps, the target network is updated by copying the weights from the main network.

9. **Iteration**: Steps 3-8 are repeated for a specified number of episodes or until convergence.

## Trading Simulation

The trading simulation is performed using the trained Q-learning agent. The agent makes buy and sell decisions based on the current state of the market. The state is represented by a sliding window of price differences.

The agent's decisions are as follows:

- **Buy**: If the agent selects the buy action and there is sufficient funds, a unit of stock is purchased, and the inventory and balance are updated accordingly.

- **Sell**: If the agent selects the sell action and there is stock in the inventory, a unit of stock is sold, and the balance is updated based on the selling price.

- **Hold**: If the agent selects the hold action, no action is taken.

The simulation keeps track of the buying and selling states, total gains, investment percentage, and the final portfolio value.

## Usage

To use the Q-learning agent for stock trading:

1. Prepare the historical price data as a list of closing prices.

2. Set the initial parameters, such as the initial money, window size, and skip size.

3. Create an instance of the `Agent` class with the desired state size, window size, trend, skip size, batch size, learning rate, and device.

4. Call the `train` method to train the agent for a specified number of iterations, providing the necessary parameters such as the number of iterations, checkpoint frequency, and initial money.

5. Evaluate the agent's performance using the `buy` method, which simulates the trading process and returns the buy and sell states, total gains, investment percentage, and the final portfolio value.

6. Analyze the results and adjust the hyperparameters as needed to optimize the trading strategy.

## Requirements

- Python 3.x
- PyTorch
- NumPy

In [3]:
!pip3 install torch torchvision

Collecting torch
  Downloading torch-2.3.0-cp311-none-macosx_11_0_arm64.whl.metadata (26 kB)
Collecting torchvision
  Downloading torchvision-0.18.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.6 kB)
Collecting filelock (from torch)
  Downloading filelock-3.14.0-py3-none-any.whl.metadata (2.8 kB)
Collecting sympy (from torch)
  Downloading sympy-1.12-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch)
  Downloading networkx-3.3-py3-none-any.whl.metadata (5.1 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Collecting mpmath>=0.19 (from sympy->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading torch-2.3.0-cp311-none-macosx_11_0_arm64.whl (61.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 MB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading torchvision-0.18.0-cp311-cp311-macosx_11_0_arm64.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━

In [12]:
from utils import *
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# Configure Modeling Parameters and Fetch Data

Enter a ticker and date range you would like to build the model on.  This model takes a a single ticker's data.  Also enter a training size for the proportion of the data you want to include in your training set vs. your test set.

In [3]:
# stock configs
ticker = ['GOOG']
start_date = '2022-04-01'
end_date = '2024-04-05'

# model configs
train_size = 0.8

In [4]:
# Data Fetching
data = fetch_stock_data(ticker, start_date, end_date)[ticker[0]]
data.reset_index(drop=False, inplace=True)
data['Date'] = pd.to_datetime(data['Date']).dt.tz_localize(None)

print(data.shape)
included_days = len(data)
data.head()

(504, 8)


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2022-04-01,140.009995,140.949997,138.796997,140.699997,23480000,0.0,0.0
1,2022-04-04,140.824493,144.043747,140.824493,143.642502,19076000,0.0,0.0
2,2022-04-05,143.399506,143.589996,140.943497,141.063004,19256000,0.0,0.0
3,2022-04-06,139.161499,139.848495,136.418106,137.175995,23574000,0.0,0.0
4,2022-04-07,136.617996,137.701508,134.857254,136.464996,19448000,0.0,0.0


In [63]:
import random
from collections import deque
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

class Agent(nn.Module):
    LEARNING_RATE = 0.003
    BATCH_SIZE = 32
    LAYER_SIZE = 256
    OUTPUT_SIZE = 3
    EPSILON = 0.5
    DECAY_RATE = 0.005
    MIN_EPSILON = 0.1
    GAMMA = 0.99
    MEMORIES = deque()
    MEMORY_SIZE = 300

    def __init__(self, state_size, window_size, trend, skip):
        super().__init__()
        self.state_size = state_size
        self.window_size = window_size
        self.half_window = window_size // 2
        self.trend = trend
        self.skip = skip
        self.INITIAL_FEATURES = torch.zeros((4, self.state_size))
        self.lstm = nn.LSTM(input_size=self.state_size, hidden_size=self.LAYER_SIZE, num_layers=1, batch_first=True)
        self.dense = nn.Linear(self.LAYER_SIZE, self.OUTPUT_SIZE)
        self.optimizer = optim.Adam(self.parameters(), lr=self.LEARNING_RATE)
        self.criterion = nn.MSELoss()

    def _memorize(self, state, action, reward, new_state, dead, rnn_state):
        self.MEMORIES.append((state, action, reward, new_state, dead, rnn_state))
        if len(self.MEMORIES) > self.MEMORY_SIZE:
            self.MEMORIES.popleft()

    def _construct_memories(self, replay):
        states = np.array([a[0] for a in replay])
        new_states = np.array([a[3] for a in replay])
        init_values = ([a[-1] for a in replay]).detach().numpy()
        states_tensor = torch.tensor(states, dtype=torch.float32)
        new_states_tensor = torch.tensor(new_states, dtype=torch.float32)
        init_values_tensor = torch.tensor(init_values, dtype=torch.float32).view(1, -1, self.LAYER_SIZE)
        Q = self.forward(states_tensor, init_values_tensor)
        Q_new = self.forward(new_states_tensor, init_values_tensor)
        replay_size = len(replay)
        X = np.empty((replay_size, 4, self.state_size))
        Y = np.empty((replay_size, self.OUTPUT_SIZE))
        INIT_VAL = init_values
        for i in range(replay_size):
            state_r, action_r, reward_r, new_state_r, dead_r, rnn_memory = replay[i]
            target = Q[i].detach().numpy()
            target[action_r] = reward_r
            if not dead_r:
                target[action_r] += self.GAMMA * torch.max(Q_new[i]).item()
            X[i] = state_r
            Y[i] = target
        return torch.tensor(X, dtype=torch.float32), torch.tensor(Y, dtype=torch.float32), INIT_VAL

    def get_state(self, t):
        window_size = self.window_size + 1
        d = t - window_size + 1
        block = self.trend[d: t + 1] if d >= 0 else -d * [self.trend[0]] + self.trend[0: t + 1]
        res = []
        for i in range(window_size - 1):
            res.append(block[i + 1] - block[i])
        return np.array(res)

    def forward(self, x, hidden):
        out, hidden = self.lstm(x, (hidden.view(1, -1, self.LAYER_SIZE).contiguous(),
                             hidden.view(1, -1, self.LAYER_SIZE).contiguous()))
        out = self.dense(out[:, -1, :])
        return out, hidden

    def buy(self, initial_money):
        starting_money = initial_money
        states_sell = []
        states_buy = []
        inventory = []
        state = self.get_state(0)
        init_value = torch.zeros((1, 2 * self.LAYER_SIZE))
        for k in range(self.INITIAL_FEATURES.shape[0]):
            self.INITIAL_FEATURES[k, :] = torch.tensor(state)
        for t in range(0, len(self.trend) - 1, self.skip):
            action, last_state = self.forward(self.INITIAL_FEATURES.unsqueeze(0), init_value)
            action = torch.argmax(action, dim=1).item()
            init_value = last_state
            next_state = self.get_state(t + 1)

            if action == 1 and initial_money >= self.trend[t]:
                inventory.append(self.trend[t])
                initial_money -= self.trend[t]
                states_buy.append(t)
                print('day %d: buy 1 unit at price %f, total balance %f' % (t, self.trend[t], initial_money))

            elif action == 2 and len(inventory):
                bought_price = inventory.pop(0)
                initial_money += self.trend[t]
                states_sell.append(t)
                try:
                    invest = ((self.trend[t] - bought_price) / bought_price) * 100
                except:
                    invest = 0
                print(
                    'day %d, sell 1 unit at price %f, investment %f %%, total balance %f,'
                    % (t, self.trend[t], invest, initial_money)
                )

            new_state = torch.cat((torch.tensor(self.get_state(t + 1)).unsqueeze(0), self.INITIAL_FEATURES[:3, :]), dim=0)
            self.INITIAL_FEATURES = new_state
        invest = ((initial_money - starting_money) / starting_money) * 100
        total_gains = initial_money - starting_money
        return states_buy, states_sell, total_gains, invest

    def train(self, iterations, checkpoint, initial_money):
        for i in range(iterations):
            total_profit = 0
            inventory = []
            state = self.get_state(0)
            starting_money = initial_money
            init_value = torch.zeros((1, 1, self.LAYER_SIZE))
            for k in range(self.INITIAL_FEATURES.shape[0]):
                self.INITIAL_FEATURES[k, :] = torch.tensor(state)
            for t in range(0, len(self.trend) - 1, self.skip):
                if np.random.rand() < self.EPSILON:
                    action = np.random.randint(self.OUTPUT_SIZE)
                else:
                    action, (last_state, _) = self.forward(self.INITIAL_FEATURES.unsqueeze(0), init_value.view(1, -1, self.LAYER_SIZE))
                    action = torch.argmax(action, dim=1).item()
                    init_value = last_state

                next_state = self.get_state(t + 1)

                if action == 1 and starting_money >= self.trend[t]:
                    inventory.append(self.trend[t])
                    starting_money -= self.trend[t]

                elif action == 2 and len(inventory) > 0:
                    bought_price = inventory.pop(0)
                    total_profit += self.trend[t] - bought_price
                    starting_money += self.trend[t]

                invest = ((starting_money - initial_money) / initial_money)
                new_state = torch.cat((torch.tensor(self.get_state(t + 1)).unsqueeze(0), self.INITIAL_FEATURES[:3, :]), dim=0)
                self._memorize(self.INITIAL_FEATURES, action, invest, new_state,
               starting_money < initial_money, last_state.squeeze(0))
                self.INITIAL_FEATURES = new_state
                batch_size = min(len(self.MEMORIES), self.BATCH_SIZE)
                replay = random.sample(self.MEMORIES, batch_size)
                X, Y, INIT_VAL = self._construct_memories(replay)

                self.optimizer.zero_grad()
                pred, _ = self.forward(X, torch.tensor(INIT_VAL, dtype=torch.float32))
                cost = self.criterion(pred, Y)
                cost.backward()
                self.optimizer.step()

                self.EPSILON = self.MIN_EPSILON + (1.0 - self.MIN_EPSILON) * np.exp(-self.DECAY_RATE * i)

            if (i + 1) % checkpoint == 0:
                print('epoch: %d, total rewards: %f.3, cost: %f, total money: %f' % (i + 1, total_profit, cost.item(), starting_money))

In [65]:
close = data.Close.values.tolist()
trend = close

initial_money = 10000
learning_rate = 0.001
state_size = 30
window_size = 30
skip = 1
batch_size = 32

# Determine the device to use (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create an instance of the Agent
agent = Agent(state_size=state_size,
              window_size=window_size,
              trend=close,
              skip=skip)

In [66]:
iterations = 100
checkpoint = 10
initial_money = 10000
agent.train(iterations, checkpoint, initial_money)

UnboundLocalError: cannot access local variable 'last_state' where it is not associated with a value

In [25]:
# Evaluate the agent's performance
states_buy, states_sell, total_gains, invest = agent.buy(initial_money)

KeyError: 0

In [11]:
import plotly.graph_objects as go

starting_money = 10000

close = data['Close']
final_share_price = close[len(close) - 1]  # Final share price
total_portfolio_value = starting_money + total_gains  # Total portfolio value
total_gains = total_portfolio_value - starting_money

fig = go.Figure()

# Candlestick trace
fig.add_trace(go.Candlestick(x=data.index,
                             open=data['Open'],
                             high=data['High'],
                             low=data['Low'],
                             close=data['Close']))

# Buy signals trace
fig.add_trace(go.Scatter(x=[data.index[i] for i in states_buy],
                         y=[close[i] for i in states_buy],
                         mode='markers',
                         name='Buy Signals',
                         marker=dict(symbol='triangle-up', size=10, color='green')))

# Sell signals trace
fig.add_trace(go.Scatter(x=[data.index[i] for i in states_sell],
                         y=[close[i] for i in states_sell],
                         mode='markers',
                         name='Sell Signals',
                         marker=dict(symbol='triangle-down', size=10, color='red')))

# Set layout
fig.update_layout(
    title=f'Total Gains: {total_gains:.2f}, Total Portfolio Value: {total_portfolio_value:.2f}',
    xaxis_title='Date',
    yaxis_title='Price',
    template='plotly_dark',
    legend=dict(x=0, y=1, orientation='h')
)

fig.show()