# COGS 188 - Project Proposal

# Project Description

You have the choice of doing either (1) an AI solve a problem style project or (2) run a Special Topics class on a topic of your choice.  This repo is assuming you want to do (1).  If you want to do (2) you should fill out the Gradescope proposal for that instead of using this repo.

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like 8-Queens or a small Traveling Salesman Problem or similar
- If its the kind of problem (e.g., RL) that interacts with a simulator or live task, then the problem will have a reasonably complex action space. For instance, a wupus world kind of thing with a 9x9 grid is definitely too small.  A simulated mountain car with a less complex 2-d road and simplified dynamics seems like a fairly low achievement level.  A more complex 3-d mountain car simulation with large extent and realistic dynamics, sure sounds great!
- If its the kind of problem that uses a dataset, then the dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training an unsupervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.
- The project must include some elements we talked about in the course
- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

# Names
- Lucas Fasting
- Yash Sharma
- Bryan Nguyen


# Abstract 
The goal of this project is to predict future stock prices based on historical price data and financial indicators. The dataset we use will consist of more than 10,000 observations and over 10 variables, including stock prices, trading volumes, and technical indicators such as moving averages and Relative Strength Index (RSI). The data we use will be measured, cleaned and formatted in the most efficient way we can. Capturing detailed  trends and market behaviors. Our project will involve feature selection to identify the most relevant predictors, followed by the application of time series forecasting models. We will also use Monte Carlo simulations  to model the uncertainty and variability in stock price predictions. This measure of performance of predictive models will be evaluated using Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE) to assess their accuracy. Success will be measured by our models' ability to accurately predict stock prices, as it interacts with the stock environment.

# Background

For years hedge funds have been have been in control of the stock market, dedicating large amounts of capital to spend on due dilligence analysis to find potential investment opportunities. Not only do they have the manpower but they have the mathematical algorithms to automate trades that exectute buy and sell positions with extreme accuracy <a name="fach"></a>[<sup>[1]</sup>](#fachnote). The use of these powerful algorithms is called quant trading. Millions of dollars have been invested in to these algorithms and have given the hedge funds an edge on the market. Often times identifying moves before they happen, capatalizing their percent gain. But as time has gone on, we have gathered historic data on the stock market, providing retail investors with the opportunity to capture, and analyze this data for their own gain. Retail investors will always be delt the losing hand here, but as computing power has become more accessible to the general public; investors at home are exprimenting with their own algorithms using different machine learning techniques.

The application of machine learning to stock market prediction began gaining traction in the late 20th and early 21st centuries. Early models, such as linear regression and decision trees, proved helpful at first initial insights but often struggled with the market's high volitlity and identifying patterns within the market <a name="ballings"></a>[<sup>[2]</sup>](#ballingnote). To address these challenges, researchers began employing more advanced techniques, such as Support Vector Machines (SVM) and ensemble methods like Random Forests and Gradient Boosting Machines (GBM), which could better capture complex relationships between features.

The introduction of time series forecasting models marked a significant advancement in stock market prediction. Models like ARIMA (AutoRegressive Integrated Moving Average) became popular for their ability to model patterns in the data. ARIMA models, however, often required extensive domain knowledge to identify appropriate parameters and could struggle with capturing long-term dependencies<a name="zhang"></a>[<sup>[3]</sup>](#zhangnote). Predicting the stock market has undergone even more change thanks to recent developments in deep learning. Recurrent neural networks (RNNs) of the Long Short-Term Memory (LSTM) kind have shown considerable improvement because of their capacity to recall long-term trends and efficiently handle sequential data. 

Predictions about the stock market that are accurate have many uses. They can assist financial institutions and investors in making well-informed decisions, maximizing trading tactics, and most importantly managing risks. Our goal as a group was to not make a program that will trade for you, or take over the due dilligence process. We want to create a tool that we ourselves would use and provide to anyone who would like a useful tool that can help conviction before entering a position in the market utilizing the tools we have learned in this class as well as other machine learning techniques that we could think of.

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

The problem we're solving is the inability and lack of tools using AI and ML methods to predict stock prices. Currently it's very difficult to predict future prices. Our problem involves numerical stock data that is quantifiable and able to be analyzed through different machine learning techniques. We have two clear metrics to measure from, with RMSE and MAPE being used. This process can be repeated for different stocks, time periods, and market conditions making it replicable across multiple scenarios. Our objective is to accurately predict prices based on historical data, which can be done by observing and predicting on data from earlier time periods and comparing it to more recent periods.

# Data

In terms of what our dataset is looking for, we are looking for a dataset that can utilize methods like RMSE (Root Mean Square Error) and MAPE (Mean Absolute Percentage Error) when evaluating the predicted versus actual values of the stock price. Additionally, the dataset should include a diverse range of companies and industries to test the robustness and generalizability of our predictive models. We aim to analyze the temporal patterns in stock prices using advanced machine learning techniques such as LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) to provide accurate stock price predictions.

- **Link/reference to obtain it**: 
  - The dataset can be obtained from [Kaggle: Intro to Recurrent Neural Networks LSTM GRU](https://www.kaggle.com/code/thebrownviking20/intro-to-recurrent-neural-networks-lstm-gru).

- **Description of the size of the dataset (# of variables, # of observations)**: 
  - The dataset includes historical stock prices for multiple companies, with variables such as date, open, high, low, close, volume, and adjusted close. It encompasses thousands of observations over several years.

- **What an observation consists of**: 
  - Each observation represents the stock price data for a single trading day of a specific company.

- **What some critical variables are, how they are represented**: 
  - `Date`: The trading date (e.g., 2021-05-17).
  - `Open`: The opening price of the stock (e.g., 135.67).
  - `High`: The highest price of the stock during the trading day (e.g., 137.45).
  - `Low`: The lowest price of the stock during the trading day (e.g., 133.21).
  - `Close`: The closing price of the stock (e.g., 136.50).
  - `Volume`: The number of shares traded (e.g., 1250000).
  - `Adj Close`: The adjusted closing price accounting for dividends and stock splits (e.g., 136.50).

- **Any special handling, transformations, cleaning, etc. will be needed**: 
  - Data cleaning to handle missing values and outliers.
  - Normalization or scaling of stock prices to ensure consistent range for model training.
  - Splitting the dataset into training, validation, and test sets to evaluate model performance.
  - Feature engineering to create additional variables such as moving averages, technical indicators, and lagged features.

# Proposed Solution

ARIMA:

The ARIMA model will be used on the stock dataset. This will allow us to model the relationships between observations, lagged observations, and residual error. The model is best used for time series data with one variable, which makes it perfect for predicting stock prices.

Steps to ARIMA:
1. Load data
2. Preprocess data
3. Split into training/testing
4. Model fitting
5. Use fitted model to make predictions
6. Evaluate performance

LSTM:

The LSTM model will allow us to see the dependencies within the data. It can be used to learn patterns over long periods of time, which means we can use it for stock prices since there's a lot of stock history.

Steps to LSTM:
1. Load data
2. Preprocess data
3. Split into training/testing
4. Create LSTM network
5. Train and make predictions
6. Evaluate performance

# Evaluation Metrics

RMSE: 

This method will allow us to see the differences between predicted and actual values. It's the square root of the average squared differences between the two values, and the lower it is the better the performance of the model.

$$ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

MAPE:

This method will allow us to view the accuracy of the predictions as a percentage value. This value is calculated by taking the average absolute percentage difference between the predicted and actual values. The lower the score is, the better the performance of the model.

$$ \text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|$$ 

By combining these two models, we will be able to measure our performance accurately. This will be to ensure that our model provides reliable predictions to the best of its ability and prove that it can be used as a financial tool in the market.

# Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination. Get creative!

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

There are some ethical concerns with the privacy and security of this data. Since we are using historical financial data there may be some privacy concerns or regulations that have to do with the companies we're taking this information from. Within the finance world, there exist various standards in order to control ethical conduct. This tool's predictions may not be used to influence the price of stocks by publicly posting the results, and users of the model should be aware of the limitations because this is not a fully comprehensive and accurate predictor of stocks.

# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* We will divide work up evenly among teammates, we will also communicate via Discord in order to get all work completed in time.
* Everyone will do their part equally.
* We will try to stick to the timeline proposed.
* Before every deadline we will meet as a team to clean up anything that needs to be turned in.

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/02  |  3 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 5/09  |  2 PM |  Do background research on topic (Everyone) | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 5/14  | 3 PM  | Edit, finalize, and submit proposal (Everyone); Search for datasets | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 5/21  | 6 PM  | Import & Wrangle Data, do some EDA (Everyone) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 5/28  | 12 PM  | Finalize wrangling/EDA; Begin programming for project (Everyone) | Discuss/edit project code; Complete project |
| 6/4 | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Everyone)| Discuss/edit full project |
| 6/12  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes

<a name="fatchnote"></a>1.[^](#fatchnote): Fach, F. (n.d.). "Advantages of Algorithms in Decision Making for Hedge Funds." *LinkedIn*.https://www.linkedin.com/pulse/advantages-algorithms-decision-making-hedge-funds-fabian-fach/


<a name="ballingsnote"></a>2.[^](#ballingsnote) Ballings, M., Van den Poel, D., Hespeels, N., & Gryp, R. (2015). "Evaluating Multiple Classifiers for Stock Price Direction Prediction." Expert Systems with Applications, 42(20), 7046-7056. https://www.sciencedirect.com/science/article/abs/pii/S0925231201007020

<a name="zhangnote"></a>3.[^](#zhangnote): Zhang, G. P. (2003). "Time Series Forecasting Using a Hybrid ARIMA and Neural Network Model." Neurocomputing, 50, 159-175.https://www.sciencedirect.com/science/article/abs/pii/S0925231201007020