Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



36 Commits

Repository files navigation


Kaggle Case: January 2022

David Pacheco Aznar

Daily News for Stock Market Prediction

The dataset was extracted from: Sun, J. (2016, August). Daily News for Stock Market Prediction, Version 1. Retrieved [11/24/2021] from Daily News for DJIA Prediction


In order to execute these notebooks, the txt file requirements.txt file has been provided.

It can be done as follows with conda: conda create --name djia and then, pip3 install -r requirements.txt

That should be enough to have the environment ready to execute the sripts.

INFO: If Colab is being used, uncomment requirements.txt line: talib-binary==0.4.19, and comment ta-lib==0.4.19 Otherwise, it should be enough as it is.


Can be found in the Demo jupyter notebook. Example results are:

Screenshot from 2022-01-09 20-17-38

Dataset Explanation and Objective

Find all the Exploratory Data Analysis: Open In Colab

In short, the whole objective is to predict the whether the stock market will go up or go down as compared to the previous day. For that, the dataset author provides the top 25 Reddit headlines every day from 2008-06-08 to 2016-07-01 and a label of whether the market went up or down. In addition, he also provides a time series with the top 25 news (even on non-trading days) and a time series extracted from Yahoo-Finance from 2008-08-08 to 2016-07-01. The labels are encoded as: 1 if the market goes up or stays the same, 0 if it goes down. The headlines are not sorted by relevance, they are just the top 25 news in no particular order. The market time series has an OHLCV format, that is: Open, High, Low, Close, Volume format.




Since we have text indexed to a time series, the first step has gone towards cleaning the texts. The texts have been cleaned using different methods.

  1. Find ngrams (Gensim), detect nouns, verbs, adverbs, adjectives and proper nouns (SpaCy)
  2. Simple cleaning (Basic preprocessing for BERT embedder)
  3. Data Cleaning detecting nouns, verbs, adverbs, adjectives and proper nouns, but ignoring ngrams

These preprocessings have been used to find relations between words/n-grams and stock market moves, accomplishing relatively good results such as:

Ngrams that only appear when the market goes up:


Ngrams that only appear when the market goes down:


Then, these preprocessings (Simple for BERT and ngrams), have been used to find topics from the news. After extensive exploration, no trivial features have been found to be relevant. Hence, the idea to cluster all news using different embeddings so as to find unsupervised topics in news. The optimal number of topics has been found by using a HDP (Hierarchical Dirichlet Process). The result being a total of 150 topics and the best silhouette score (0.39) being achieved with BERT and a dimensionality reduction with Umap.

The clustering accomplished with Bert + Umap + preprocessing with SpaCy bert_umap

Time Series

The time series presents a clear upper trend until 2015. The seasonality comes to 0. However, by using fractional differentiation (de Prado), only a 0.6 factor of differentiation of the original adjusted close price series is needed in order to pass the dickey fuller test with a p-value lower than 0.005. Fractional Differentiation vs p-value fracdiff

Also, 25 indicators have been added to increase predicitve power.


LSTM Modelling

Link to Model Code: Open In Colab

The model is a stacked LSTM with up to four inputs built ontop of the tensorflow API. The inputs are:

  1. Window of past n-returns (the fwd_return is dropped, since it would cause data leackage).
  2. Topics classified from BERT with Umap as dummy variables
  3. Sentiments on news (readability metrics, polarity...)
  4. Stock market indicators With one layer and no tuning, the model accomplishes an AUC above 0.5 in the whole test-set. In addition, accomplishes a 54% in out-of-bag samples.


Surprisingly, the best combination of inputs has been found to be Window of n-returns and Topics. That is, topics from BERT, made a better prediciton on stock price than stock indicators.

Classic Models approach

Link to Classic Models code: Open In Colab

Since LSTM models got a good performance, the decision to test with non deep-learning models, was made. The results weren't that bad in the cv, in fact, some actually got 56 AUC. The feature importance was: image

Meaning, the most important features were the past returns, but to some extent, some topics, one sentiment and one indicator. When performing K-Bests, the result were all 230 features for best performance. LGBM and RF have been runed using OPTUNA and skopt respectively. Finally, a Voting Classifier with hard and soft decision making were made. Resulting in around 55 AUC on the tet-set. Pretty good performance and faster than LSTMs.


Colab Link to Tuning : Open In Colab

After the great success accomplished with a stacked LSTM, efforts have been made to optimize the neural network hyperparameters such as: the number of units per layer, layer dropouts, optimizer epsilon and learning rate and batch size. To avoid overffiting, keras-tuner base class Tuner has been overridden in order to add a Blocking Time Series Cross-Validation while optimizing parameters using Bayesian Optimization. The results of the cross-validation vary, but top performing configurations get to up to 55-57% average AUC on test-sets. The following image represents how Blocking Time Series Cross-Validation works. The main idea is to avoid data leackage and is said to be better for cross-validation than standard cumulative cross-validation (Sklearn TimeSeriesSplit implemenatation).


Train-sets in blue, test-sets in red.

Also, the sliding window option was also added, in order to be able to perform two different types of cv on time series. Both scipts (sliding window and blocking time series) can be found in src/ts_utils/

Classic and LSTM CV scores and time to run (google Colab)

Model Time (5 Folds CV) AUC (Tuned) - CV
2 Layer LSTM 1’15s Up to 58.68
3 Layer LSTM 3’20s Up to 57.28
Soft Voting Ensemble: LGBM + RF + XGB 20s Up to 56.60
Hard Voting Ensemble: LGBM + RF + XGB 21s Up to 58.25


Access Model Comparison and Metric Analysis Notebook in colab: Open In Colab

Finally, the metric analysis and model comparison. This last notebook, recaps all models built accross the notebook. Since no model seemed to be consistently outperforming a given threshold (or there were lots of bias towards a class), the decision to build a stacking classifier was done. That incorporates the top 2 performing models in the test set, in order to reduce variance and hopefully, improve the AUC. The results have been impressing, leaving a 62 AUC and 65% accuracy in out of bag data. The ROC curves:


The Precision Recall Curves:


Finally, the confusion matrix of the stacking model:


The model outperformed the "random" 50% chance up-down model and the always-buy method (out-of-bag enabled for a 56% accuracy with this strategy). When it ocmes to things to improve, data really lacked. More observations would have been very helpful to the model, since with standard train-test-split, there is exactly a breakpoint where the test-set begins and that may be causing problems in the prediction. However, the performance, in the end seems to be really good.

Future work

In order to improve performance, it could be interesting to try using more specific news. That is, industry specific news. In addition, single stock prediction or indices of industries could very probably be better suited for NLP models. Correlation would be higher and hence, better performance would very possibly be achieved. Also, as stated earlier, more data to train the model should have been presented. That is, in form of other stocks, higher frequency, a larger time frame or build a GAN to generate fake data to train with. To improve the model, encoders and decoders could be a good option, also TCNs or other forms of convolutional/recurrent networks. Also, the stacking model could be further improved by using a neural net to join predictions instead of using XGBClassifier.


This repository is licensed under GPL-2.0 License. See LICENSE for details.