# **Sentiment analysis with generative modelling: application for stock price forecasting**

**Table of Contents**

1. Introduction
    1. *Sentiment analysis for stock price forecasting*
    2. *Differing approaches*
    3. *A novel generative approach: SSESTM*
2. Supervised Sentiment Extraction via Screening and Topic Modelling (SSESTM)
    1. *Correlation screening for removing neutral words*
    2. *Learning Positive/Negative Vocabulary Sets*
    3. *Infering Sentiment from Unseen News Articles*
3. Setup
    1. *Data*
    2. *Text Preprocessing*
    3. *Model Implementation*
    4. *Hyperparameter Selection and Learning Strategy*
    5. *Performance Evaluation*
4. Use-case: Forecasting Boeing Stock Prices
    1. *Model Performance: All Data*
    2. *Model Performance: Time Series Environment*
5. Conclusion
    1. *Summary*
    2. *Limitations and potential improvements*

# **1. Introduction**

## 1.1. Sentiment analysis for stock price forecasting

<img src="images/rio_tinto_1.png">

<img src="images/goldman.png">

**Figure 1**: Originals from [The Guardian](https://www.theguardian.com/us)

Tu, 2007
Since the 2010s, there has been a rising interest within the financial services' industry for applications of natural language processing algorithms. One such application is **sentiment analysis for stock price forecasting**. The relationship between stock prices and news articles is not a novel subject, increasing compute power and the advent of Big Data has made it more easily accessible for financial institutions. Examples of newsworthy events with repercussions on stock market prices include:

- Corporate scandals: e.g. Boeing 737 Max's crashes, Rio Tinto's accidental destruction of the Australian Aboriginal Juunkan Gorge, CD Projekt RED's botched launch of their flagship AAA video game on last-generation gaming consoles
- Market downturn events: e.g. 2007-08 Financial/Real Estate Crisis, COVID-19 Pandemic

In past economic litterature, stock market prices were assumed to incorporate all news/textual information available (Eugene Fama's market efficiency hypothesis from 1970). Therefore there were theoretically no gains from mining text information for exploiting arbitrage opportunies (e.g. after major news events, buying/selling a stock before the market reacts and adjusts the stock's market price).

While it is a reasonable assumption to make, it is also worth considerating a relaxation of Fama's hypothesis: whereas in the long run, stocks fully incorporate all the available information in their prices, in the short term stock prices don't always adjust immediately to incoming information. There might even be news events that eventually foreshadow future stock price movements (e.g. solvency issues, disappointing sales). Ideally this should be revelatory of arbitrage opportunities that can be exploited for financial gain. 

As sentiment analysis is an application of text mining which aims to categorize textual data as either positive, negative or neutral (via a scoring system known as polarity), our goal here is to **classify textual data as either potentially indicative of future stock price increases or decreases** (eventually leading to investment/portfolio allocation decisions).

## 1.2. Differing approaches for sentiment analysis

For asset managers and data scientist looking to integrate sentiment analysis into the investment decision process, there are three main approaches: using pre-built vocabularies (with positive/negative sounding words) to output a polarity score on text, training supervised learning algorithms (from linear models to neural networks) or relying on commercial solutions offered by financial data vendors (e.g. Bloomberg, Reuters, RavenPack, IHS Markit). All three approaches have their inherent limitations: many pre-built vocabularies (such as Harvard-IV psychosocial dictionnary) weren't specifically designed with financial forecasting in mind and thus might not be relevant for our stock price prediction. The aforementioned commercial solutions are shrouded in opaqueness as data vendors are unlikely to reveal the entirety of the intricate details of their scoring system. **Supervised learning algorithms** represent a more viable solution as it solves the shortcomings of pre-built vocabularies and commercial solutions: first, sentiment vocabulary is learned directly from data and therefore suits the user's use-case. Second, there is no opaqueness as the user is in control of the scoring methodology (e.g. logistic regression, recurrent neural networks, supervised topic models).

<br>

<img src="images/ravenpack.png">

**Figure 2**: Screenshot illustration of RavenPack's Text Analytics Commercial Software obtained from a [blogpost on AWS](https://aws.amazon.com/solutions/case-studies/ravenpack-case-study/)

*How do we select the most ideal scoring methodology for our prediction task? We will start by restrincting ourselves to Bag-of-word (BoW) representations (mainly that they ignore word context only whether a word is actually present in a document). There are two competing approaches: discriminative and generative modeling. In **discriminative models**, we are only interested in modeling (using our stock prediction use-case) stock returns with the respect to our text inputs $\mathbb{P}(Y|X)$. Logistic regression, which makes no assumptions on the behaviour of our prediction variable, answers this description: it's straightforward, implemented in a wide range of software librairies and has the added bonus of being interpretable (we will know exactly which words contribute positively/negatively to stock price mouvements). Conversely, **generative models** look to model the joint probability distribution between stock returns and text inputs. Through Bayes formula, this allows us to model the full relationship of text inputs with respect to the prediction variable $\mathbb{P}(X|Y)$, i.e. which words lead to positive/negative returns $\mathbb{P}(X|Y=0)$ and $\mathbb{P}(X|Y=1)$.*

<br>

<img src="images/discriminative_vs_generative.png">

**Figure 3**: Original from [Tu, 2007 on Semantic Scholar](https://www.semanticscholar.org/paper/Learning-Generative-Models-via-Discriminative-Tu/23b80dc704e25cf52b5a14935002fc083ce9c317).

*Which approach outperforms the other is a matter of debate, although generative models tend to be less commonly used in stock price forecasting literature. This stems from the limited availability of generative algorithms, compared to discriminative models (which span from linear classifiers to neural networks). The most well known generative model in NLP literature when dealing with bag-of-word representations is topic models, dimensionality reduction algorithms that are mostly unsupervised and used for document clustering (thus unlabelled text data). There are a few extensions that allow for supervised topic models (e.g. Labelled Latent Dirichlet Allocation), which will be consider in our use-case.*

## 1.3. A novel generative approach: SESTM

*While it is not often clear whether generative models - in spite of being more complex - outperform discriminative models,* Zheng Tracy Ke, Dacheng Xiu and Bryan Kelly - respectively professors at Harvard University, University of Chicago and Yale University (the latter also working as a Quantitative Researcher for AQR Capital Management) - devised a new supervised generative model for sentiment analysis: **Supervised Sentiment Extraction via Screening and Topic Modelling** (SSESTM), essentially a 3-stage methodology for:

1. Initially screening words likely to portend stock price increases (and conversely decreases)
2. Then rely on a supervised topic model to learn 2 seperate sets of sentiment vocabularies (or topics): one that augurs positive returns and another that foreshadows negative returns.
3. This generative model is then used to score newly unseen documents. From those scored news articles, they are able to generate investment recommendations: buy stocks with positive news scores, sell stocks with negative scores.

Stock returns are notoriously difficult to predict due to weak signal-to-noise properties, but the authors do hold that their algorithm provides sound investment recommendations, enough for generating decent financial performance.

Following Ke et al. (2019), we will take an interest in stock price forecasting solely using textual data. Using news/forum data from our own internal NLP datalake (e.g. 3 million text inputs for Boeing), **we will attempt to forecast Boeing stock prices (from January 2015 to November 2020) with the authors' SESTM method**.

We find that [Insert Conclusions].

# **2. SSESTM**

<img src="images/sestm.png" width=650 height=400>

**Figure 4**: Original from [Ke et al., 2019](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3389884)

 Let's assume we have a corpus of $n$ news articles and a lexicon of $m$ word tokens. Thus, we can model our corpus of $n$ documents as a bag of words representation: a document-word matrix $D$ of dimension $\mathbb{R}^{n \times m}$.

From this corpus $D$, the goal of SSESTM is to learn custom sentiment (positive/negative) dictionnaries from one's own use-case dataset, without having to rely on pre-existing rule-based dictionnaries or purchase expensive solutions from third-party data vendors. This requires two components:

- Select a set of words $\hat S$ that are likely to foreshadow rises/decreases in the phenomena we are trying to forecast. SSESTM accomplishes this simply through word counts. E.g. stopwords (e.g. common words such as "the", "a", "thus",  which are unlikely to portend to any meaning)
- From this filtered vocabulary list, weight each word by the sentiment it is the likeliest to foreshadow: e.g. "stimulus" for positive returns, "coronavirus" for negative returns. This is done through a supervised topic model (akin to Labelled LDA) that learns 2 distinct topics (or dictionnaries): one for words that presage positive returns ($O_{+}$) and one for words that presage negative returns ($O_{-}$).

After learning $O_{+}$ and $O_{-}$, we can infer the sentiment (positive or negative) of unseen news articles $\hat p$ through maximum likelihood estimation.

## 2.1. Correlation screening for excluding neutral words

**2.1.S1**. For each word $1 \leq j \leq m,$ let:

$$f_{j}=\frac{\# \text { articles including word } j \text { AND having } \operatorname{sgn}(y)=1}{\# \text { articles including word } j}$$

This first step ranks words based on how often they appear in documents during periods of positive returns. Words with high $f_j$ are likely to augur positive returns, whereas low $f_j$ is likely to portend negative returns. Sandwitched in-between are neutral words, such as stopwords for example, unlikely to be indicative of stock rises/decreases.

**2.1.S2**. For a proper threshold $\alpha_{+}>0, \alpha_{-}>0,$ and $\kappa>0$ to be determined, construct:

<br>

$$
\widehat{S}=\left\{j: f_{j} \geq 1 / 2+\alpha_{+}\right\} \cup\left\{j: f_{j} \leq 1 / 2-\alpha_{-}\right\} \cap\left\{j: k_{j} \geq \kappa\right\}
$$
where $k_{j}$ is the total count of articles in which word $j$ appears.

Next step is excluding neutral words, i.e. vocabulary with middling $f_j$ values are excluded. The authors require the user to tune multiple hyperparameters:

- $\alpha_{+}$, upper bound for excluding neutral words with average $f_j$
- $\alpha_{-}$, lower bound for excluding neutral words with average $f_j$
- $\kappa$, number of count occurences required (excludes infrequent words)

Two opposite pitfalls must be avoided: excluding to many words will drastically limit the size of the vocabulary, but the opposite will diminish the potency of SSESTM. If done optimally, this leads to a large dimensionality reduction, which explains how the authors claim that their approach can run on laptop.

## 2.2. Learning Positive/Negative Vocabulary Sets

After filtering out neutral words, the next step is learning positive/negative term vocabularies from our data. One approach would to run a LASSO regression classifier on our corpus $D_{[\hat S]}$ (minus the neutral words) with words acting as features. We obtain positive/negative weights for a decent number of words and therefore our positive/negative vocabularies. The authors prefer to implement a generative model instead, where the joint distribution between words and returns is fully specified and learned from data.

A popular generative model for modelling a distribution of words over documents is topic models (e.g. LDA, HDP, CTM) and for learning positive/negative dictionnaries, the authors construct what they describe to be a "2-topic topic model" that models positive auguring words as its first topic $\widehat{O}_{+}$ and negative auguring terms as its second topic $\widehat{O}_{-}$. Their "2-topic topic model" differs somewhat from classical topic models as the vast majority of topic models are unsupervised and thus don't require inputting labels. In contrast, the authors' model is a form of supervised topic model which are rarer in topic modeling litterature (e.g. Labelled LDA).

In this supervised topic model with 2 topics, each document $\hat{d}_{i,[S]}$ is modelled with a multinomial distribution:

$$ {d}_{i,[S]} \sim \text{Multinomial}\left(s_i, p_i O_{+} + (1 - p_i) O_{-}\right) $$

The expected value of $\hat{d}_{i,[S]}$ can thus be written as:

$$ \mathbb{E}({d}_{i,[S]}) = p_i O_{+} + (1 - p_i) O_{-} $$

We now need to learn $\hat p_i$ and topics/vocabularies $\widehat{O}_{+}$ and $\widehat{O}_{-}$.

**2.2.S3**. Sort the returns $\left\{y_{i}\right\}_{i=1}^{n}$ in ascending order. For each $1 \leq i \leq n,$ let:
$$
\widehat{p}_{i}=\frac{\text { rank of } y_{i} \text { in all returns }}{n}
$$

For learning $\hat p_i$, the authors rely on rank statistics, which is known to be robust to outliers.

**2.2.S4**. For $1 \leq i \leq n,$ let $\widehat{s}_{i}$ be the total counts of words from $\widehat{S}$ in article $i,$ and let $\hat{d}_{i}=\widehat{s}_{i}^{-1} d_{i,[\widehat{S}]}$ Write $\widehat{D}=\left[\widehat{d}_{1}, \widehat{d}_{2}, \ldots, \widehat{d}_{n}\right] .$

Recall the expected value per document:

$$ \mathbb{E}(\tilde {d}_{i,[S]}) = p_i O_{+} + (1 - p_i) O_{-} $$

Therefore, if we consider all corpus documents $D$, we obtain under matrix form:

$$ \mathbb{E} (\widehat D^{T}) = O W $$

According to the authors, $O$ can be approximated with an ordinary least squares (OLS) of $D$ on $W$

<br>
$$
\widehat{O}=\widehat{D} \widehat{W}^{\prime}\left(\widehat{W} \widehat{W}^{\prime}\right)^{-1}, \quad \text { where } \quad \widehat{W}=\left[\begin{array}{cccc}
\widehat{p}_{1} & \widehat{p}_{2} & \cdots & \widehat{p}_{n} \\
1-\widehat{p}_{1} & 1-\widehat{p}_{2} & \cdots & 1-\widehat{p}_{n}
\end{array}\right]
$$
<br>

Set negative entries of $\widehat{O}$ to zero and re-normalize each column to have a unit $\ell^{1}$ -norm. We use the same notation $\widehat{O}$ for the resulting matrix. We also use $\widehat{O}_{\pm}$ to denote the two columns of $\widehat{O}=\left[\widehat{O}_{+}, \widehat{O}_{-}\right]$.

## 2.3. Infering Sentiment from Unseen News Articles

For inferring the sentiment score for newer articles, the authors use maximum likelihood estimation to infer $\hat p$.

**2.3.S5**. Let $\widehat{s}$ be the total count of words from $\widehat{S}$ in the new article. Obtain $\widehat{p}$ by

<br>

$$
\widehat{p}=\arg \max _{p \in[0,1]}\left\{\widetilde{s}^{-1} \sum_{j=1}^{\hat{s}} d_{j} \log \left(p \widehat{O}_{+, j}+(1-p) \widehat{O}_{-, j}\right)+\lambda \log (p(1-p))\right\}
$$

<br>

where $d_{j}, \widehat{O}_{+, j},$ and $\widehat{O}_{-, j}$ are the $j$ th entries of the corresponding vectors, and $\lambda>0$ is a tuning parameter.

# **3. Setup**

Our experimental approach differs from Ke et al. (2019) in three aspects:

- Ke et al. (2019) trained their algorithms on a large Dow Jones news archive (*Dow Jones Newswires Machine Text Feed and Archive*), which spans from January 1989 to December 2012 (with data from February 2004 to July 2017 as their validation set). Our news dataset is more recent as it spans from **January 2015 to November 2020**.
- The authors' news dataset is of considerably higher quality than the text data we will be using for our sentiment analysis use-case. This is primarly because their text articles from the *Dow Jones Newswires* archive are tagged to specific stocks. Our news dataset is **more raw and noisier**, which makes for a more challenging use-case.
- Their dataset size is of size approximatly 13 million news articles (6.5 million for training and 6.7 for validation) and is multi-asset. Our use-case differs considerably as we will be focusing on a **single stock** (here Boeing Company).

## 3.1. Data

### 3.1.1. Prediction Variable: Stock Price Returns

We will focus on predicting returns from the Boeing Company. Boeing is decent use-case for sentiment analysis with 2 major news events with considerable negative impact on Boeing Company's reputation: the 737 Max's flight woes and the COVID-19 Pandemic's darkening of the aviation industry.

<img src="images/boeing_stock_price.png">

**Figure 5**: Boeing Stock Prices at Opening Time (9:00), available on Yahoo Finance

Stock returns are a notoriously difficult to forecast due to low signal-to-noise ratio and Ke et al. (2019) spend a few sections on the lagged relationship between news and stock price mouvements. Thus, we have to be careful in defining our target variable, specifically the time horizon: are we just looking at 1-day daily returns? Or do we want to define a longer horizon, which will allow us to better capture market regimes? Huge stock price mouvements tend to undergo a correction, thus we can't use 1-day daily returns. After running a few tests, we chose 10 business days of cumulative returns as our predictive variable. (**Note: Need a better justification**)

Thus we will compute cumulative returns in a 10 business day rolling window timeframe:

```python
horizon = 10
target = 'Open'
target_data['return'] = target_data[target] \
                                .pct_change() \
                                .rolling(horizon) \
                                .apply(lambda x: np.sum(x)) \
                                .shift(-horizon - 1) \
                                .dropna(how="all")
```

From which we obtain for Boeing:

<img src="images/boeing_10d_cumulative_returns.png">

**Figure 6**: Rolling window of 10-day future cumulative returns on Boeing Open Stock Prices

As sentiment analysis' goal is to predict the polarity (positive/negative) of a textual input, this can be addressed through a binary classification problem where we look to predict the likelihood that a text document will lead to future positive returns. Our training data will be composed of text documents preceeding positive returns (labelled as **1**) and text documents preceeding negative returns (labelled as **0**).

```python
target_data['return'] = (target_data['return'] > 0.0).astype(int)
```

Ideally, the distribution of our prediction variable should be balanced. But of course in practice it is never the case in a setting where market environments change and investor behavior shaken by downturn events. With the way we construct our sentiment analysis setup, we obtain unbalanced returns over time:

<img src="images/boeing_return_distrib.png">

**Figure 7**: Average Distribution of Positive Labels (i.e. stock price increases) for Boeing

In the space of 3 months, the distribution of returns can change from mostly positive returns (over 80%) to unanimously negative returns (close de 0%). Realistically it is not hard to imagine that a bearish market environment (e.g. 2008 Financial Crisis, COVID-19 in March) would skewer our prediction variable distribution towards negative returns. Market corrections after downturns (e.g. COVID-19 post-April) would also skewer our predictive variable frequency towards positive returns. This will have to addressed when running our experiments.

### 3.1.2. Explanatory Variables: Text Data

For Boeing text inputs, in the past we collected an extremely large corpus of news/blog posts/forum posts from data vendors all stored on a datalake. Through ElasticSearch queries and disambiguation, we were able to obtain respectively 3 million text documents:

<img src="images/boeing_article_compo.png">

**Figure 8**: Meta-information on 2.7 million text data available for Boeing (from Jan. 2015 to Nov. 2020)

Let's check news count over time for Boeing:

<img src="images/boeing_article_count.png">

**Figure 9**: Daily/bi-weekly news count available for Boeing (from Jan. 2015 to Nov. 2020)

## 3.2. Text Preprocessing

### 3.2.1. Regular expressions and stopwords

Stopwords are words that express no intrinsic meaning and are most commonly used as grammatical expressions (e.g. the, who, where). We can also add a number of commonly used words (e.g. say, months). Our list combines NLTK stopwords with those from spaCy, for a total of 402 stopwords. Note that these words are highly likely to be seen as neutral during our correlation screening for keeping only positive/neutral words thus if we missed a couple of stopwords, SSESTM will take care of exclude them.

```python
from nltk.corpus import stopwords
from spacy.lang.en.stop_words import STOP_WORDS as spacy_stopwords

nltk_stopwords = list(stopwords.words('english'))
spacy_stopwords = list(spacy_stopwords)
STOPWORDS = list(set(nltk_stopwords + spacy_stopwords))
```

Additionnally, string expressions representing emails or HTTP links are removed to prevent stopwords such as "http" or "www" from appearing in our learned vocabularies. Punctuation is also taken care of through regular expressions.

```python
punct = """-!"'#&$%\()*+,.:;<=>?@[\\]^_`{|}~–’"""
punc_pattern = '{}'.format('|'.join(['\\'+char for char in punct]))
full_regex_pattern = r'(\S*@\S*\s?|http\S+|https\S+|\s+|{})'.format(punc_pattern)
text_data['text'] = text_data['text'].str.replace(full_regex_pattern, ' ', regex=True)
```

### 3.2.2. Lemmatization

<img src="images/SpaCy_logo.png" width="250" height="150" class="center">

**Figure 10**: Logo for [spaCy](https://spacy.io/)

Our next step is running lemmatization on our text inputs. There are many redundant word declinations (e.g. plural versions) that can be reduced to a common lemma. Naturally when dealing with millions of text documents with text lengths ranging from a few words to over 100,000 words, this process can be time consuming. As for the library of choice for lemmatization, we opted for spaCy's lemmatization functions on our large corpus (inspired from [Cite Source]):

```python
import spacy
from spacy.lemmatizer import Lemmatizer

nlp = spacy.load('en_core_web_lg', disable=['tagger', 'parser', 'ner'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))

def lemmatize_pipe(doc):
    """ AAA
    """
    lemma_list = [str(tok.lemma_).lower() for tok in doc
                  if ((tok.is_alpha) and
                      (tok.lemma_ not in STOPWORDS)) and
                      (len(tok.lemma_) >= 3)]
    lemma = ' '.join(lemma_list)
    return lemma

def preprocess_pipe(texts):
    """ AAA
    """
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=200):
        preproc_pipe.append(lemmatize_pipe(doc))
    return preproc_pipe

text_data['text'] = preprocess_pipe(data_pandas['text'])
```

### 3.2.3. Bag-of-word representations

<img src="images/dask_logo.png" width="250" height="150" class="center">

**Figure 11**: Logo for [Dask](https://dask.org/)

As stated in previous sections, we are going to use **bag-of-word representations** to process our text data into a readable format for our supervised learning algorithms. It is undoubtely a fact that bag-of-word representations have limitations, primarily that word context is ignored (only the presence of words matters). It is also prone to computational difficulties as increasing the size of our training vocabularity increases the dimensionality of our training matrix (and the ensuing difficulties ten-fold). It also requires a decent amount of additional preprocessing (e.g. handling stopwords, synonyms, plurality, etc.).

Embeddings - which are methods for representing highly-dimensional documents as lower-dimensional vectors - are often touted as superior alternatives to bag-of-word representations, as they capture more easily word context and can also be trained from training data. For NLP tasks more complex than sentiment analysis (e.g. language translation, text summarization, text generation such as GPT-3), it is indeed the case that embeddings have superseded bag-of-word representations owning to their much improved performance and ability to leverage deep learning algorithms.

In spite of this, bag-of-word representations offer a crucial advantage over embeddings: interpretability. While there is a wide range of options for training embeddings (or relying on pretrained embeddings), bag-of-word representations are much easier to understand and most importantly control: if our algorithm fails to perform, we can simply check the inputs for irregularities (assuming we use mostly-interpretable supervised learning models).

In practice to represent our text as bag-of-words, we rely on Scikit-Learn (or Dask ML) and its function `CountVectorizer`. It conveniantly outputs the bag-of-words matrix as a sparse matrix for ease in storing and computing.

```python
import dask.bag as db
from dask_ml.feature_extraction.text import CountVectorizer as DaskMLCountVectorizer

corpus = db.from_sequence(text_data, npartitions=10)
vectorizer = DaskMLCountVectorizer(token_pattern = r"\b[a-zA-Z]{3,}\b",
                                   stop_words=STOPWORDS,
                                   ngram_range=(1,1),
                                   max_features=8577)
X = vectorizer.fit_transform(corpus)
X_bow = X.compute()
X_bow
```

```
<445798x8577 sparse matrix of type '<class 'numpy.int64'>'
    with 66772209 stored elements in Compressed Sparse Row format>
```

For tokenization, we restrict ourselves to unigrams. The words the most likely to be highlighted by SSESTM are frequent non-neutral words and bigrams (by design) tend to appear in lower frequency than unigrams. Thus, for simplicity they are excluded from our experiments.

## 3.3. Model Implementation

We provide a description of how we implemented **SSESTM**:

- ***Removal of neutral words***: there are two steps for removing neutral words. The first is to create a bag-of-word representation for every token and then filter out words that empirically appear equally during bullish/bearish market regimes. For bag-of-word representation, we utilize Scikit-Learn's CountVectorizer function (which can be substitued with a multithreaded variant provided by Dask ML). For filtering out neutral words, we compute a matrix dot product between the bag-of-word representation and the target vector (positive or negative returns). This allows to compute word frequencies only on documents on days preceding stock price increases/decreases, which will allow us to find and exclude neutral words. This reveals to be extremely memory-efficient and fast, especially when dealing with representations of over 2 million documents;
- ***Learning sentiment topics/vocabularies***: for this step we need to implement rank statistics then an ordinary least squares (OLS), which can be simply through `numpy` functions;
- ***Scoring out-of-sample documents***: for maximum likelihood estimation (MLE), we implement `scipy` optimization functions.

As our baseline, we will consider 2 supervised discriminative models:

- **Logistic Regression** (implemented in Scikit-Learn)
- **LightGBM**

And 2 supervised generative models:

- **Naive Bayes** (implemented in Scikit-Learn)
- **Labelled Latent Dirichlet Allocation** (implemented in `tomotopy`)

## 3.4. Hyperparameter Selection and Learning Strategy

Naturally, training our algorithm on the entire dataset would prevent us form capturing changes in word polarity over time (e.g. how investors and news outlets soured on Boeing's 737 Max after multiple incidents).

In the original paper from Ke et al. (2019), the authors considered Dow Jones' stocks spanning from 2004 to 2017 and updating their model every year (on a rolling window basis), thus a 14 training sets in total. Word polarity can change over time from highs (e.g. new announcements concerning Boeing's new 737 Max, Trump tax cuts) to lows (e.g. 737 Max crashes, COVID-19) so it is important to factor in evolving market regimes. We will retrain our model every 3 months (or 12 weeks) which will then be used to predict text polarity for unseen text the following month.

<img src="images/cross_val_boeing.png">

**Figure 12**: Time Series Cross-Validation Strategy (only subset of training/testing dates)

The ideal choice for hyperparameters is much trickier decision to be made. Let's start with the issue of excluding neutral words. The authors give users the option to set a lower/upper bound given a 50% threshold of word apperance in text documents published during positive market regimes. E.g. in our Boeing use-case, we could exclude words with apperance values between 48% and 52%. But this assumes a balanced distribution for our prediction variable (returns), which is unlikely to be verified in a real life environment as we addressed beforehand. The following chart illustrates changes in word frequency over time:

<img src="images/vocab_dynamics.png">

**Figure 13**: Evolving word frequency over text and median word sentiment

An easy temporary fix would be to dynamically set thresholds for excluding neutral words: during bullish markets, thresholds would be increased from 50% and conversely decreased during bearish markets. For each training set, we will take the median polarity as our threshold (e.g. if "boeing" has a polarity of 0.43 and represents the median polarity of our vocabularity then 43% is our threshold).

Through a grid search routine, we will test a decent number of hyperparameter combinations for all 4 and choose the set that delivers the best testing set performance:

- $(\alpha_{+}, \alpha_{-}) \in \{0.0, 0.01, 0.04\}$
- $\kappa \in \{1500, 5000\}$
- $\lambda \in \{0.0001, 0.5\}$

Finally, we will be training SESTM on the entirety of our text data, mostly as an exercise in showcasing the algorithm's scalability. We will be comparing our implementation of SESTM with an alternative version freely available on GitHub, measuring differences in execution time and memory usage.

## 3.5. Performance Evaluation

In our use-case, sentiment analysis algorithms will output a probability (or a score) indicative of how likely a text document foreshadows future positive (label 1) or negative returns (label 0) for the next 10 days. For unseen text documents and if labels are evenly distributed, a threshold of 50% is set as the delimiter between text inputs that will lead to positive returns and those leading to negative returns.

But in the case of unbalanced label distribution, the majority label will skewer predictions towards a median probability that is further away (in either direction) from the ideal 50% threshold. Therefore using the 50% threshold for predicting labels will lead to an over-representation of the majority label in its predictions.

To address this constraint, we will be evaluating our model predictions with metrics (**ROC AUC Score**, **Cumulative Average Precision Score**) that evaluate more heuristically how well our sentiment analysis model is capable of distinguishing positive returns from negative returns (regardless of the threshold set, whether it is 50%, 25%, 75%, etc.).

Once we get these polarity predictions for unseen Boeing text data, **how do we evaluate and exploit those predictions?** Since our text documents are labelled, we could compare the document labels with our predictions. We could also follow Ke et al. (2019) in computing Boeing's average polarity score every day. Thus instead of having to evaluate over predictions over 2 million text documents, we would only need to assess around 1,400 predictions. This would be closer to our ultimate goal of forecasting Boeing stock prices.

For completeness, we will showcase both results on predicted polarities over text data (**text polarity**) and Boeing's average daily polarity score (**stock polarity**).

# **4. Use-case: Forecasting Boeing Stock Prices**

## 4.1. Results on a real life production environment

For each of our five models:

- SSESTM
- Labelled Latent Dirichlet Allocation (LLDA)
- Logistic Regression (LR)
- Naive Bayes (NB)
- XGBoost

From January 2015 to November 2020, we run different sentiment analysis models on our set of Boeing stock returns and textual data, with a sliding window of 12 weeks of training data used then to predict 1 week of polarity scores (testing data)a. Those predictions are then evaluated through ROC AUC and Cumulative Average Precision (CAP) metrics. Thus, we obtain the following results:

<br>

|  Model | AUC Test | CAP Test |
|:---|---:|---:|
| **SSESTM**  |  51.5% | 57.6%  |
| **Labelled Latent Dirichlet Allocation**  | a%  | a%  |
| **Logistic Regression** | a%  | a%  |
| **Naive Bayes Classifier** | 54.5%  | 60.7%  |
| **XGBoost Classifier** | a%  | a%  |

**Table 1**: Predictive performance for 2 million text polarity predictions (Apr. 2015 - Nov. 2020)

<br>

Next, we compute the average predicted polarity score for each day (e.g. 30 polarity scores for 30 unseen text inputs gives us an average polarity of 55%). From those average polarity scores foreshadows us whether Boeing's stock price is going to rise or fall in the next 10 days (starting tomorrow). We evaluate those predictions once again with ROC AUC and CAP metrics:

<br>

|  Model | AUC Test | CAP Test |
|:---|---:|---:|
| **SSESTM**  |  57.4% | 62.8%  |
| **Labelled Latent Dirichlet Allocation**  | a% | a%  |
| **Logistic Regression** | a%  | a%  |
| **Naive Bayes Classifier** | 63.7%  | 70.0%  |
| **XGBoost Classifier** | a%  | a%  |

**Table 2**: Predictive performance for 1 410 stock sentiment predictions (Apr. 2015 - Nov. 2020)

<br>

We plot both ROC and Precision-Recall Curves over our five models (for both text polarity predictions and stock sentiment predictions):

<img src="images/results_ROC_AUC.png">

**Figure 14**: ROC Curve for text polarity predictions (Apr. 2015 - Nov. 2020)

<br>

<img src="images/results_PRC.png">

**Figure 15**: Precision-Recall Curve for text polarity predictions (Apr. 2015 - Nov. 2020)

<br>

Now if we look at performance over time, [Insert Commentary]:

<br>

<img src="images/results_ts_metrics.png">

**Figure 16**: Performance over time every 3 months for stock sentiment predictions (Apr. 2015 - Nov. 2020)

## 4.2. Analysis of results

[Add evolution of wordclouds for SSESTM]

[Add evolution of average polarity for SSESTM and Naive Bayes]

<img src="images/results_POLARITY.png">

**Figure 17**: Daily time series of predicted stock sentiment (Apr. 2015 - Nov. 2020)

## 4.3. Grid Search for SESTM

|  $\alpha_{+}$ | $\alpha_{-}$ | $\kappa$ | $\lambda$ | Text AUC | Text CAP | Stock AUC | Stock CAP |
|---:|---:|---:|---:|---:|---:| ---:|---:|
| 0.01  |  0.01 | 1500  | 0.0001 | a% | a% | a% |a% |
| 0.04  |  0.04 | 1500  | 0.0001 | a% | a% | a% |a% |
| 0.01  |  0.01 | 1500  | 0.5 | a% | a%| a% |a% |
| 0.04  |  0.04 | 1500  | 0.5 | a% | a% | a% |a% |
| 0.01  |  0.01 | 5000  | 0.0001 | 51.6% | 57.1% | 56.8% | 64.1% |
| 0.04  |  0.04 | 5000  | 0.0001 | 50.3% | 56.8% | 52.3% | 59.8% |
| 0.01  |  0.01 | 5000  | 0.5 |  51.6% | 57.9% | 56.5% | 62.2% |
| 0.04  |  0.04 | 5000  | 0.5 | 50.2% | 57.0% | 52.1% | 60.0% |

## 4.4. Comparaison with alternative implementations

# **5. Conclusion**

- Summary
- Limitations of SESTM

A number of potential improvements can be made on SSESTM