# COGS 118B - Final Project

# Insert title here

## Group members

- Ryan Chen
- Nicholas Gao
- Matthew Miyagishima

# Abstract 

The goal of our project is to design a stock trading agent that interacts with historical stock data that learnings optimal trading strategies using Markov Decision Processes (MDP) and Reinforcement Learning (RL). We will use historical stock data from Yahoo Finance. The data will be accessed through the yfinance Python package. The dataset stores key features such as Opening Price, Highest Price, Lowest Price, Closing Price, Trading Volume, and Date which are measured daily. First we will prepare the data by cleaning missing values and normalizing key features to ensure consistency. Then the data we will train an agent to buy, sell, or hold decisions based on past market trends utilizing reinforcement learning algorithm such as Q-Learning and Monte-Carlo Simulations. The performance of the agent will be evaulated using ...

# Background

The stock market is a highly dynamic environment influenced by various factors, making it challenging to develop reliable trading strategies. Traditional rule-based approaches are often too rigid and fail to adapt to changing market conditions. Recently, machine learning techniques, particularly **Reinforcement Learning (RL)**, have become a popular tool for financial applications due to their ability to learn optimal strategies through direct interaction with the environment.

In this project, we aim to build a **stock market trading agent** that leverages **Markov Decision Processes (MDPs)** as the underlying framework and applies **Q-Learning** and **Monte Carlo methods** to learn an optimal trading policy. The agent will use historical stock price data to simulate trading decisions and learn when to buy, sell, or hold a stock to maximize long-term profitability. We will focus on backtesting the agent’s strategy on historical data to assess its performance in a simulated environment.


# Problem Statement

The objective of this project is to design a stock market trading agent that interacts with historical stock data and learns to optimize its trading strategy using **Markov Decision Processes (MDPs)**. The problem can be modeled as an MDP with the following components:

- **State Space:** The state represents market conditions, derived from technical indicators such as recent price movements, moving averages, and volatility measures.
- **Action Space:** The agent can choose one of three actions at each time step:
  - **Buy:** Purchase a fixed quantity of the stock.
  - **Sell:** Sell the currently held stock.
  - **Hold:** Take no action and maintain the current position.
- **Reward Function:** The reward at each step is the change in the portfolio value after taking an action, incentivizing profitable trades while penalizing losses or excessive trading.

We will train the agent using two reinforcement learning approaches:
1. **Monte Carlo Methods** for episodic policy evaluation and learning from full episodes of simulated trading.
2. **Q-Learning**, a model-free method, to improve the agent’s strategy by updating Q-values for each state-action pair through iterative exploration.

Performance will be evaluated using key metrics, including cumulative return, Sharpe ratio, and maximum drawdown.


# Data

We will use historical stock price data from the following sources:

1. **Yahoo Finance API** ([https://finance.yahoo.com](https://finance.yahoo.com))
   - Provides daily and intraday stock data.
   - Variables: `Date`, `Open`, `High`, `Low`, `Close`, `Adjusted Close`, `Volume`.

2. **S&P 500 Historical Data**
   - Used as a benchmark for evaluating the trading agent’s performance.

### Example Variables (Feature Set):
- Price data (`Open`, `High`, `Low`, `Close`)
- **Technical Indicators**: Moving averages (5-day, 20-day, 50-day), Relative Strength Index (RSI), Bollinger Bands, Momentum, Volatility, and MACD (Moving Average Convergence Divergence).

### Data Preprocessing:
- Handle missing values and normalize the features to ensure model stability.
- Generate state representations by calculating technical indicators.
- Define the reward function as the percentage change in portfolio value after each action.


# Proposed Solution

The solution to the problem statement above will be agents trained on stock trading. Our agents will be trained to buy, hold, or sell stocks in its portfolio to maximize its returns. With two different reinforcement learning approaches, we will evaluate how each trained agent behave differently. Agents will be trained on data mentioned above (price data and technical indicators) to make optimal stock trading decisions. While we are not considering another model as a benchmark, we will benchmark our agents with historical averages of the S&P 500.

**Monte Carlo Methods**

The agent will simulate the entire trading period using historical data of stocks in the training set to calculate reward values for actions taken at different states, as well as generate an optimal policy to take advantage of bullish or bearish markets.

**Q-Learning**

Q-Learning is an algorithm that learns the optimal action at each state, and the model simply needs to follow the selected actions. We will implement this using a hashtable where keys are each trading day and the values are the actions to take.

# Evaluation Metrics

The main evaluation metric that we will use will be how much the agent grows/shrinks their portfolio percentage-wise over the test period. We will do so by giving the agent a portfolio to start off with at the beginning of the test period and evaluate the portfolio's worth daily throughout testing to measure how well the agent is doing. We believe that this is a good evaluation metric as the main goal of the agent is to maximize gains through buying, holding, and selling stocks.

A mathematical representation of this metric would be

$G_T = \frac{V_T - V_0}{V_0}$

Where
- $G_T$ is the gain/loss on day T
- $V_T$ is the value of the portfolio on day T
- $V_0$ is the value of the portfolio i the beginning

# Results

In [1]:
%load_ext autoreload
%autoreload 2

### Monte Carlo Simulations

In [2]:
from training_env import StockTrainingEnv
from monte_carlo import monte_carlo_train, monte_carlo_eval

In [None]:
# Set Up Stock Trading Environment
env = StockTrainingEnv(tickers=['AAPL', 'TSLA', 'META', 'NVDA', 'GME'])

In [None]:
# Train the Monte Carlo Agent
monte_carlo_q_table = monte_carlo_train(env)

In [None]:
# Evaluating the Agent Performance
monte_carlo_evaluation = monte_carlo_eval(env, monte_carlo_q_table)

### Policy Gradient

In [6]:
from policy_gradient import PolicyNetwork, PolicyGradientAgent, train_policy_gradient

In [7]:
# # Define dimensions for the agent
state_dim = env.observation_space.shape[1]
num_tickers = len(env.tickers)
possible_trades = env.possible_trades

In [8]:
# # Create agent
agent = PolicyGradientAgent(
    state_dim=state_dim,
    num_tickers=num_tickers,
    possible_trades=possible_trades
)

In [None]:
# # Train agent
train_policy_gradient(env, agent, episodes=1000, gamma=0.99, lr=0.001)

### A2C

In [None]:
from a2c_model import ActorCriticNetwork, A2CAgent, a2c_train, a2c_eval

In [None]:
# Create Agent and train
a2c_agent = A2CAgent(env)
a2c_agent_train = a2c_train(env, a2c_agent)

In [None]:
# Evaluate A2C agent
a2c_agent_eval = a2c_eval(env, a2c_agent)

# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.


### Limitations

Are there any problems with the work?  For instance would more data change the nature of the problem? Would it be good to explore more hyperparams than you had time for?   


### Future work
Looking at the limitations and/or the toughest parts of the problem and/or the situations where the algorithm(s) did the worst... is there something you'd like to try to make these better.

### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
