Skip to content

Distributed stock price forecasting system to predict S&P 500 stock prices.

Notifications You must be signed in to change notification settings

LeonardoEmili/stock-price-forecasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stock Price Forecasting

Neural stock price forecasting system using fundamental analysis and technical analysis to predict the trend of stocks from the S&P 500 index. The main contributions of this work are summarized as follows:

  • Develop the first approach with Pytorch Lightning as a learning framework, employing attention and Recurrent Neural Networks (RNNs). For further insights, read the dedicated report or the related notebook.
  • Develop a distributed approach with Pytorch, PySpark, and Petastorm, leveraging a cluster of nodes to parallelize the computation. It builds on top of the former and extends it introducing the powerful Spark's SQL queries, enabling the system to scale with a large amount of data. For an overview of the system, see the slides or the related notebook.

Datasets

We use data from Kaggle's public challenges, namely a first dataset with financial reports from S&P 500 from 2003 to 2013, and a second dataset containing stock market data. By aligning the two datasets and removing outliers (refer to the notebooks to see how the alignment is performed), we get an enriched dataset that can be used to perform both fundamental and technical analysis.

Results

A benchmark showing the performance of our trading strategy algorithm (details in the slides, pages 14-16).

MSE R2 Adjusted R2 Operation accuracy Profit
DecisionTreeRegressor 0.078 0.852 - 55.45% 35.97%
RandomForestRegressor 0.104 0.803 - 57.01% 51.61%
LSTM 0.021 0.939 0.897 56.52% 58.35%

How to train the distributed system?

In case you would like to install and configure PySpark on your local machine, please follow the instructions described here. Otherwise, you can clone the notebook and import it into Databricks as described here.

How to test the system?

For a simple and ready-to-use test, simply run the test/evaluate.py script that refers to the distributed system with pre-trained weights for the LSTM model. Otherwise, you can re-train the system using a model of your choice, and use the new weights to perform the evaluation.

Project structure

.
├── data/                     # Stock prices and fundamental data
├── report/
│   ├── main.pdf              # Project report for the dlai-2021 course
│   ├── main.tex
│   └── ...
├── test/
│   ├── data/                 # Model weights and test data
│   ├── evaluate.py           # Evaluation script
│   └── ...
├── dist_forecasting.ipynb    # PySpark distributed stock prediction system
├── forecasting.ipynb         # Stock prediction system
├── environment.yml           # Training environment
└── ...