# **[Medium Article]**

# **1. Introduction**

Since the last decade, there has been a rising interest within the financial services' industry for applications of natural language processing algorithms. One such application is **sentiment analysis for stock price forecasting**. Although the relationship between stock prices and news articles is not a novel subject, increasing compute power, democratization of machine learning algorithms and the advent of Big Data has made it more easily accessible for financial institutions. Examples of newsworthy events with repercussions on stock market prices include:

- Corporate scandals: e.g. Boeing 737 Max's crashes, Rio Tinto's accidental destruction of the Australian Aboriginal Juunkan Gorge, etc.
- Market regime changes: e.g. 2007-08 Financial/Real Estate Crisis, COVID-19 Pandemic

In past economic litterature, stock market prices were assumed to incorporate all news/textual information available (Eugene Fama's market efficiency hypothesis from 1970). Therefore there were theoretically no gains from mining text information for exploiting arbitrage opportunies (e.g. after major news events, buying/selling a stock before the market reacts and adjusts the stock's market price).

While it is a reasonable assumption to make, it is also worth considerating a relaxation of Fama's hypothesis: whereas in the long run, stocks fully incorporate all the available information in their prices, in the short term stock prices don't always adjust immediately to incoming information. There might even be news events that eventually foreshadow future stock price movements (e.g. solvency issues, disappointing sales). Ideally this should be revelatory of arbitrage opportunities that can be exploited for financial gain. 

From Support Vector Machines to Recurrent Neural Networks, there are many options available for asset managers and data scientists for implementing sentiment analysis on company data. We will explore a novel sentiment analysis approach by researchers Zheng Tracy Ke, Dacheng Xiu and Bryan Kelly: **Supervised Sentiment Extraction via Screening and Topic Modelling** (SSESTM): initially screening words likely for removal of neutral sounding words. Then running ordinary least squares (OLS) to learn 2 seperate sets of sentiment vocabularies (or topics): one that augurs positive returns and another that foreshadows negative returns, which can then be used to score newly unseen documents from 0 to 1 (where 1 indicates a high likelihood of future positive returns). From those scored news articles, they are able to generate investment recommendations: buy stocks with positive news scores, sell stocks with negative scores.

Where the algorithm distinguishes itself from competition (such as logistic regression, SVMs or neural networks) is that while it outputs a probability score of text positivity (akin to classification algorithms' predicted probabilities), it accepts continuous variables as a valid input (while typical classification models must be provided categorical values). In essence, it is a "classification" algorithm that can accept regression inputs. 

Following Ke et al. (2019), we attempt to port their algorithm on to our own use-case: from 2015 to 2020, **modeling positive or negative-sentiment generated from news or forum discussions over the Boeing Company**. We will focus on:

1. Algorithmic scalability compared to alternative implements of SSESTM freely available on GitHub
2. How well Ke's algorithm is capable of handling major market regime changes (e.g. transition from the pre-COVID bullish market regime to the post-COVID environment).

# **2. SSESTM**

<img src="images/1_theory/sestm.png" width=550 height=350>

**Figure 1**: Original from [Ke et al., 2019](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3389884)

 Let's assume we have a corpus of $n$ news articles and a lexicon of $m$ word tokens. Thus, we can model our corpus of $n$ documents as a bag of words representation: a document-word matrix $D$ of dimension $\mathbb{R}^{n \times m}$.

From this corpus $D$, the goal of SSESTM is to learn custom sentiment (positive/negative) dictionnaries from one's own use-case dataset, without having to rely on pre-existing rule-based dictionnaries or purchase expensive solutions from third-party data vendors. This requires two components:

- Select a set of words $\hat S$ that are likely to foreshadow rises/decreases in the phenomena we are trying to forecast. SSESTM accomplishes this simply through word counts. E.g. stopwords (e.g. common words such as "the", "a", "thus",  which are unlikely to portend to any meaning)
- From this filtered vocabulary list, weight each word by the sentiment it is the likeliest to foreshadow: e.g. "stimulus" for positive returns, "coronavirus" for negative returns. This is done through a supervised topic model (akin to Labelled LDA) that learns 2 distinct topics: one for words that presage positive returns ($O_{+}$) and one for words that presage negative returns ($O_{-}$).

After learning $O_{+}$ and $O_{-}$, we can infer the sentiment (positive or negative) of unseen news articles $\hat p$ through maximum likelihood estimation.

## 2.1. Screening for excluding neutral words

**2.1.S1**. For each word $1 \leq j \leq m,$ let:

<br>

$$f_{j}=\frac{\# \text { articles including word } j \text { AND having } \operatorname{sgn}(y)=1}{\# \text { articles including word } j}$$

<br>

This first step ranks words based on how often they appear in documents during periods of positive returns. Words with high $f_j$ are likely to augur positive returns, whereas low $f_j$ is likely to portend negative returns. Sandwitched in-between are neutral words, such as stopwords for example, unlikely to be indicative of stock rises/decreases.

**2.1.S2**. For a proper threshold $\alpha_{+}>0, \alpha_{-}>0,$ and $\kappa>0$ to be determined, construct:

<br>

$$
\widehat{S}=\left\{j: f_{j} \geq 1 / 2+\alpha_{+}\right\} \cup\left\{j: f_{j} \leq 1 / 2-\alpha_{-}\right\} \cap\left\{j: k_{j} \geq \kappa\right\}
$$

<br>

where $k_{j}$ is the total count of articles in which word $j$ appears.

Next step is excluding neutral words, i.e. vocabulary with middling $f_j$ values are excluded. The authors require the user to tune multiple hyperparameters:

- $\alpha_{+}$, upper bound for excluding neutral words with average $f_j$
- $\alpha_{-}$, lower bound for excluding neutral words with average $f_j$
- $\kappa$, number of count occurences required (excludes infrequent words)

Two opposite pitfalls must be avoided: excluding to many words will drastically limit the size of the vocabulary, but the opposite will diminish the potency of SSESTM. If done optimally, this leads to a large dimensionality reduction, which explains how the authors claim that their approach can run on laptop.

## 2.2. Learning Positive/Negative Vocabulary Sets

After filtering out neutral words, the next step is learning positive/negative term vocabularies from our data. One approach would to run a LASSO regression classifier on our corpus $D_{[\hat S]}$ (minus the neutral words) with words acting as features. We obtain positive/negative weights for a decent number of words and therefore our positive/negative vocabularies. The authors prefer to implement a generative model instead, where the joint distribution between words and returns is fully specified and learned from data.

A popular generative model for modelling a distribution of words over documents is topic models and for learning positive/negative dictionnaries, the authors construct what they describe to be a "2-topic topic model" that models positive auguring words as its first topic $\widehat{O}_{+}$ and negative auguring terms as its second topic $\widehat{O}_{-}$. Their "2-topic topic model" differs somewhat from classical topic models as the vast majority of topic models are unsupervised and thus don't require inputting labels. In contrast, the authors' model is a form of supervised topic model which are rarer in topic modeling litterature (e.g. Labelled LDA).

In this supervised topic model with 2 topics, each document $\hat{d}_{i,[S]}$ is modelled with a multinomial distribution:

<br>

$$ {d}_{i,[S]} \sim \text{Multinomial}\left(s_i, p_i O_{+} + (1 - p_i) O_{-}\right) $$

<br>

The expected value of $\hat{d}_{i,[S]}$ can thus be written as:

<br>

$$ \mathbb{E}({d}_{i,[S]}) = p_i O_{+} + (1 - p_i) O_{-} $$

<br>

We now need to learn $\hat p_i$ and topics/vocabularies $\widehat{O}_{+}$ and $\widehat{O}_{-}$.

**2.2.S3**. For learning $\hat p_i$, the authors rely on rank statistics, which are known to be more robust to outliers. Sort the returns $\left\{y_{i}\right\}_{i=1}^{n}$ in ascending order. For each $1 \leq i \leq n,$ let:

<br>

$$
\widehat{p}_{i}=\frac{\text { rank of } y_{i} \text { in all returns }}{n}
$$

<br>

This phase is extremely important since it will determine whether words are seen as generating positive or negative sentiment. Each article is ranked by its associated return, thus words that are frequently present in the articles with the best returns are much more likely to be seen as positive sounding. Conversely, words present in the articles with the worst performing returns will be seen as foreshadowing negative sentiment. So it is not necessarily the sign of returns (increase or decrease) that dictates word sentiment, but rank of returns.

**2.2.S4**. Now we need to estimate our topics of positive words $O_{+}$ and negative words $O_{-}$. We wish to express all corpus documents $D$ as a weighted combination of positive and negative sentiment words:

<br>

$$ D^{'} = O W^{\prime} $$

<br>

According to the authors, $O$ can be approximated with an ordinary least squares (OLS) of $D$ on $W$:

<br>
$$
\widehat{O}=\widehat{D} \widehat{W}^{\prime}\left(\widehat{W} \widehat{W}^{\prime}\right)^{-1}, \quad \text { where } \quad \widehat{W}=\left[\begin{array}{cccc}
\widehat{p}_{1} & \widehat{p}_{2} & \cdots & \widehat{p}_{n} \\
1-\widehat{p}_{1} & 1-\widehat{p}_{2} & \cdots & 1-\widehat{p}_{n}
\end{array}\right]
$$
<br>

Set negative entries of $\widehat{O}$ to zero and re-normalize each column to have a unit $\ell^{1}$ -norm. We use the same notation $\widehat{O}$ for the resulting matrix. We also use $\widehat{O}_{\pm}$ to denote the two columns of $\widehat{O}=\left[\widehat{O}_{+}, \widehat{O}_{-}\right]$.

## 2.3. Infering Sentiment from Unseen News Articles

For inferring the sentiment score for newer articles, the authors use maximum likelihood estimation to infer $\hat p$. As stated in the introduction, SSESTM can take as inputs both categorical targets (stock price increases or decreases) and continuous targets (continuous returns), all the while outputting your typical classification predictive probabilities (here the polarity of a text document).

**2.3.S5**. Let $\widehat{s}$ be the total count of words from $\widehat{S}$ in the new article. Obtain $\widehat{p}$ by

<br>

$$
\widehat{p}=\arg \max _{p \in[0,1]}\left\{\widetilde{s}^{-1} \sum_{j=1}^{\hat{s}} d_{j} \log \left(p \widehat{O}_{+, j}+(1-p) \widehat{O}_{-, j}\right)+\lambda \log (p(1-p))\right\}
$$

<br>

where $d_{j}, \widehat{O}_{+, j},$ and $\widehat{O}_{-, j}$ are the $j$ th entries of the corresponding vectors, and $\lambda>0$ is a tuning parameter.

# **3. Setup**

Our approach differs from Ke et al. (2019) in two aspects:

- Ke et al. (2019) trained their algorithms on a large Dow Jones news archive (*Dow Jones Newswires Machine Text Feed and Archive*), which spans from January 1989 to December 2012 (with data from February 2004 to July 2017 as their validation set). Our news dataset is more recent as it spans from **January 2015 to November 2020**.
- Their dataset size is of size approximatly 13 million news articles (6.5 million for training and 6.7 for validation) and is multi-asset. Our use-case differs considerably as we will be focusing on a **single stock** (here Boeing Company).

## 3.1. Data

We will focus on modelling the relationship between returns and text data related to the Boeing Company. Boeing is decent use-case for sentiment analysis with 2 major news events with considerable impact on Boeing Company's market investor appeal: the 737 Max's flight woes and the COVID-19 Pandemic's darkening of the aviation industry. We will use **daily returns** as our variable of interest.

<img src="images/2_data/boeing_stock_price.png">

**Figure 5**: Boeing Stock Prices at Opening Time (9:00), available on Yahoo Finance

For Boeing text inputs, in the past we collected an extremely large corpus of news/blog posts/forum posts from data vendors all stored on a datalake. Through ElasticSearch queries and disambiguation, we were able to obtain respectively 3 million text documents:

<img src="images/2_data/boeing_article_compo.png">

**Figure 8**: Meta-information on 2.7 million text data available for Boeing (from Jan. 2015 to Nov. 2020)

Let's check news count over time for Boeing:

<img src="images/2_data/boeing_article_count.png">

**Figure 9**: Daily/bi-weekly news count available for Boeing (from Jan. 2015 to Nov. 2020)

## 3.2. Text Preprocessing

- **Stopwords**: words that express no intrinsic meaning and are most commonly used as grammatical expressions (e.g. the, who, where). We can also add a number of commonly used words (e.g. say, months). Our list combines NLTK stopwords with those from spaCy, for a total of 402 stopwords. Note that these words are highly likely to be seen as neutral during our correlation screening for keeping only positive/neutral words thus if we missed a couple of stopwords, SSESTM will take care of exclude them. Additionnally, string expressions representing emails or HTTP links are removed to prevent stopwords such as "http" or "www" from appearing in our learned vocabularies. Punctuation is also taken care of through regular expressions.

- **Lemmatization**: there are many redundant word declinations (e.g. plural versions) that can be reduced to a common lemma. Naturally when dealing with millions of text documents with text lengths ranging from a few words to over 100,000 words, this process can be time consuming. As for the library of choice for lemmatization, we opted for spaCy's lemmatization functions on our large corpus.

- **Bag-of-word representations**: offer a crucial advantage over alternative representations (such as embeddings), which is interpretability. While there is a wide range of options for training embeddings (or relying on pretrained embeddings), bag-of-word representations are much easier to understand and most importantly control: if our algorithm fails to perform, we can simply check the inputs for irregularities (assuming we use mostly-interpretable supervised learning models). In practice to represent our text as bag-of-words, we rely on Scikit-Learn (or Dask ML) and its function `CountVectorizer`. It conveniantly outputs the bag-of-words matrix as a sparse matrix for ease in storing and computing.

- **Tokenization**: for simplicity, we restrict ourselves to unigrams.

## 3.3. Model Implementation

We provide a description of how we implemented **SSESTM**:

- **Removal of neutral words**: there are two steps for removing neutral words. The first is to create a bag-of-word representation for every token and then filter out words that empirically appear equally during bullish/bearish market regimes. For bag-of-word representation, we utilize Scikit-Learn's CountVectorizer function (which can be substitued with a multithreaded variant provided by Dask ML) which returns a bag-of-word representation under sparse matrix format (which is crucial for improving computational speed and limit RAM usage). For filtering out neutral words, we compute a matrix dot product between the bag-of-word representation and the target vector (positive or negative returns). This allows to compute word frequencies only on documents on days preceding stock price increases/decreases, which will allow us to find and exclude neutral words. This reveals to be extremely memory-efficient and fast, especially when dealing with representations of over 2 million documents;

- **Learning sentiment topics/vocabularies**: for this step we need to implement rank statistics then an ordinary least squares (OLS), which can be simply through `numpy` functions;

- **Scoring out-of-sample documents**: for maximum likelihood estimation (MLE), we implement `scipy` optimization functions.

## 3.4. Learning Strategy

Naturally, training our algorithm on the entire dataset would prevent us form capturing changes in word polarity over time (e.g. how investors and news outlets soured on Boeing's 737 Max after multiple incidents).

In the original paper from Ke et al. (2019), the authors considered Dow Jones' stocks spanning from 2004 to 2017 and updating their model every year (on a rolling window basis), thus a 14 training sets in total. Word polarity can change over time from highs (e.g. new announcements concerning Boeing's new 737 Max, Trump tax cuts) to lows (e.g. 737 Max crashes, COVID-19) so it is important to factor in evolving market regimes. For our results, we will train SSESTM on a **24-week (or approximately 6 month) sliding window**. After training on 1 sample, we **shift the training window by 1 week**, starting from January 5th 2015 and ending in October 2020.

The ideal choice for hyperparameters is much trickier decision to be made. Let's start with the issue of excluding neutral words. The authors give users the option to set a lower/upper bound given a 50% threshold of word apperance in text documents published during positive market regimes. E.g. in our Boeing use-case, we could exclude words with apperance values between 48% and 52%. But this assumes a balanced distribution for our prediction variable (returns), which is unlikely to be verified in a real life environment as we addressed beforehand. The following chart illustrates changes in word frequency over time:

<img src="images/4_results/vocab_dynamics.png">

**Figure 13**: [TODO: Word co-occurence with positive returns]

An easy temporary fix would be to dynamically set thresholds for excluding neutral words: during bullish markets, thresholds would be increased from 50% and conversely decreased during bearish markets. For each training set, we will take the median polarity as our threshold (e.g. if "boeing" has a polarity of 0.43 and represents the median polarity of our vocabularity then 43% is our threshold).

In our use-case, we have retained the following hyperparameters (the $\lambda$ hyperparameter is out-of-scope since we are focusing on modelling word sentiment):

- $\alpha_{+} = \alpha_{-} = 0.01$
- $\kappa = 1500$

Finally, we will be comparing our implementation of SESTM with alternative versions freely available on GitHub, measuring differences in execution time and memory usage.

# **4. Results**

## 4.1. Algorithmic scalability

Our working environment is an Intel Xeon with 8 cores and 64GB of RAM. We ran both our algorithmic implementation of SSESTM and an alternative available on GitHub from [mrepetto94](https://github.com/mrepetto94/sentiment_modelling) on a 50,000 subset of our 2.7 million news articles. Training for our implementation took **23 seconds**, whereas the GitHub alternative took **12 minutes and 34 seconds**.

And as a bonus, we ran our implementation of SSESTM on the entirety of the 2.7 million text documents. In this case, training took **19 minutes and 55 seconds**.

## 4.2. Word probability distributions

After learning positive $O_{+}$ and negative $O_{-}$ scores, similarly to Ke et al. (2019) we compute the average sentiment tone: 

<br>

$$T_{d} = O_{+,d} - O_{-,d}$$

<br>

for each word $d$. We compute the average sentiment tone for our training samples pre-COVID and display through wordclouds both the words with the highest tone (words associated with positive returns) and those with lowest tone (words associated with negative words).

In total, we trained SSESTM on 300 **24-week training samples** spanning from January 2020 to October 2020. We plot a few wordclouds of those training samples to illustrate how investor sentiment over Boeing changed over time:

<img src="images/4_results/wordcloud_0.png">

Overall, the first semester of 2015 was overall a positive environment for Boeing. At the start of the year, news events favoring positive returns include the signing of contracts by NASA over Boeing's reusable spacecraft CST-100 Starliner and SpaceX competitor Crew Dragon.

<img src="images/4_results/wordcloud_1.png">

Second semester of 2016 saw competition between Lockheed Martin and Boeing over the replacement of the United States Navy's aging fleet of Boeing F-18 Super Hornet hets, with Boeing offering a revamp of its Super Hornets whereas Lockheed Martin was building a stealthier replacement with the F-35 Lightning II. Boeing eventually lost against rivals Lockheed. Following Donald Trump's election in November 2016, the president-elect tweeted a salvo of critical comments over the F-35 Lightning's spiraling costs, which most likely helped boost Boeing's stock price.

<img src="images/4_results/wordcloud_3.png">

Starting on the 22nd of January 2018, Trump sparked a trade war between the US and China with a salvo of unilateral tarriffs on imported Chinese goods (including solar panels, aluminium, washing machines and steel). The Chinese gouvernment retaliated with tariffs on US goods. Rising geopolitical tensions between the two biggest world superpowers increased the risk of military conflict and the potential for military aviation contracts between Boeing and US allies in the region. Conversely, negative topics include a crash of a Boeing 737-201 in Cuba on May 18th 2018.

<img src="images/4_results/wordcloud_4.png">

Naturally the major event that damaged Boeing's reputation and investor confidence was the double accidents related to the Boeing 737 Max's faulty MCAS and its subsequent grounding by aviation authories all the around the globe.

Thus to summarize, some subjects that led to **positive returns** (optimistic investor behaviour) include:

- Geopolitical tensions in South East Asia ("china", "tariff") favoring arms sales for Boeing ("military", "buy", "contract") with references to US leaders or state officials ("trump", "matthis")
- Opportunities for space exploration and collaboration with both NASA and the private sector ("space", spacex", "nasa"). For example, towards the end of 2014, for resupplying the International Space Station, NASA signed contracts with Boeing (Starliner) and SpaceX (Crew Dragon) for its reusable spacecrafts.

Negative subjects that led to **negative returns** (pessimistic investor behaviour):

- Boeing aicraft accidents ("crash", "safety", "kill", "fatal", "tragedy") including the now reputationnally-damaging 737 Max crashes ("max", "mcas", "sensor", "faa", "ethiopia", "ethiopian", "ababa", "indonesia") which lead to the aircraft's grounding from March 2019 to November 2020 ("ground", "fleet"). Additional accidents include accidental shooting down of a Boeing commercial flight by the Iranian military after the assasination of Qasem Soleimani in January 2020 ("iran") and a Cubana Aviacion crash of a Boeing in 2018 ("cuba", "havana")
- Boeing losing out against rivals Lockheed Martin over replacing the aging F-18A Super Hornet in the US Navy ("lockheed", "fighter", "beat", "deal") in 2017 (despite supporting tweets from Donald Trump)

## 4.3. SSESTM during the COVID-19 pandemic

We turn our attention to the COVID-19 pandemic era, which saw the stock market suffer considerable losses. Boeing did claw back some of its losses in the aftermath of the initial downturn. If we look at the COVID-19 pandemic, **from the initial downturn to reassurances of stimulus measures from US Congress (the CARES act), while wordclouds paint a more nuanced and complex picture**:

- On the positive spectrum, keywords such as "bailout", "bankrupt", "taxpayer", "bail" and "spend" indicate that investors were confident that stimulus measures (via loans) from US Congress to ailling businesses will helped brighten Boeing's outlook. 
- On the negative spectrum, lockdown restrictions from COVID-19 grounded air travel to a halt. Those restrictions crippled the commercial airline industry ("southwest", "airline"), and also hampered Boeing's attempts to resume flights for its reputation-stricken Boeing 737 Max ("max") after two deadly accidents ("ethiopian", "crash", "faa", "debris").

<img src="images/4_results/wordcloud_6.png">

Initially, there was optimism from investors that gouvernmental agencies would finally reauthorize commercial flights for its 737 Boeing after the entire fleet was grounded for almost an entire year. This explains why subjects related to the 737 Max were the high point of news/discussions over Boeing. The most negatively perceived events were the Ukraine International Airlines 752 accidentally being shotdown by the Iranian military in January 2020 ("ukrainian", "debris", "iran", "teheran", "missile").

<br>

<img src="images/4_results/wordcloud_covid_3.png">

But then the COVID-19 pandemic hit and with restrictions on international travel, investor confidence in the airline industry plummetted ("coronavirus", "outbreak"). Reassurances from US Congress and the Federal Reserve ("bailout", "bankrupt") and emergency layoffs ("worker", "layoff", "cut") helped alleviate investor pessimism.

<img src="images/4_results/wordcloud_covid_5.png">

<img src="images/4_results/wordcloud_covid_7.png">

<img src="images/4_results/wordcloud_covid_8.png">

<br>

# **5. Conclusion**

With our series of results, we applied SSESTM on a large internal corpus of text documents over Boeing, identifying news topics that could either foreshadow positive returns (civilian/military contracts, space exploration, governemental bailouts) or negative returns (crashes and grounding of the 737 Max, COVID-19 outbreak). Rather than looking at the sign of returns (positive or negative) for determining word sentiment, SSESTM ranks returns from best to worst performers and assumes that words looked in text documents that were published on days with the best performing returns generate positif sentiment (and vice-versa for worst performers). Therefore, SSESTM is a promising algorithm giving users an visual description of events that are the most likely to explain positive or negative returns.

In our tests, a training size of 24 weeks (or approximately 6 months) was found to be of sufficient size, decent enough for detecting market regime changes, such as moving from a pre-COVID to a post-COVID environment (how "coronavirus" was initially seen as a negative subject matter before transitioning to more ambiguous state due to Boeing clawing back some of its stock market losses). *Had our training samples been of insufficient size* [TODO: Hard our training samples been unbalanced], then we would be stumble into environments where returns are either entirely positive or negative. In each case, the worst performing returns would still be positive and the best performers would still be negative respectively. If the training size is chosen haphardly, this modelling choice of determining word sentiment based the ranking of stock returns will bias results. Thus, special care needs to be taken into defining the size of training samples. Other hyperparameters that should be taken into consideration are the algorithm's severity in screening neutral words and the horizon of returns. For the latter, we chose daily returns to keep things simple but we could have used different types of returns.

Finally, a possible extension of SSESTM would be for words in both positive and negative return topics to be grouped together, forming coherent subtopics which would simply the interpretation of wordclouds (e.g. "max.crash.ethopian.lion.mcas" or "covid.pandemic.bailout.tax.bankcruptcy").

***
Sensibility to the training window: need to ensure that distribution of returns adequately captures full picture
- Ranking des returns
- On a retirer de l'étude l'aspect stratégie (prédire polarité text pour faire du stock picking)
- Utile: avoir des explications = avoir les news qui expliquent les returns
- Hyperparamètres:
    - Taille du train
    - Filtrage des mots
    - Horizon où les returns sont calculés = conditionne la réussite de l'algo
- Si le prix ne fait que monter et il y a "covid" => "covid" => return positif
- Ouverture: mots qui appartiennent au même topic, d'où un travail de topic modelling = grouper des mots ensemble = simplifier la lecture des résultats

#  4_train_for_rapport.py

In [None]:
import os
import sys
import multiprocessing
import numpy as np
import pandas as pd
import dask.bag as db
import dask.dataframe as dd
from nltk.corpus import stopwords
from distributed import Client
from spacy.lang.en.stop_words import STOP_WORDS as spacy_stopwords
from dask_ml.feature_extraction.text import CountVectorizer as DaskMLCountVectorizer
from datetime import datetime
SEED = 2077
N_JOBS = multiprocessing.cpu_count() - 1


# Models & Hyperparameters
import shap
import tomotopy as tp
import lightgbm as lgb
from sestm import SESTM
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import roc_auc_score


# Prepare stopwords list
nltk_stopwords = list(stopwords.words('english'))
spacy_stopwords = list(spacy_stopwords)
STOPWORDS = list(set(nltk_stopwords + spacy_stopwords))
STOPWORDS += ['january', 'february', 'march', 'april', 'june', 'july', 'august',
              'september', 'october', 'november', 'december', 'com', 'http',
              'https', 'said', 'like', 'new', 'year', 'years', 'news', 'boeing']



def get_vocab_per_regime(train, test):
    """ AAA
    """
    # Run Dask ML
    train_corpus = db.from_sequence(train['text'].values, npartitions=10)
    test_corpus = db.from_sequence(test['text'].values, npartitions=10)
    vectorizer = DaskMLCountVectorizer(token_pattern = r"\b[a-zA-Z]{3,}\b",
                                       stop_words=STOPWORDS,
                                       ngram_range=(1,1),
                                       max_features=5000)
    X_train = vectorizer.fit_transform(train_corpus).compute()
    X_test = vectorizer.transform(test_corpus).compute()
    
    # Get Word Counts and Frequencies
    word_counts = np.sum(X_train, axis=0)
    word_counts = np.asarray(word_counts).ravel()
    words = vectorizer.get_feature_names()
    
    # Get co-occurences with positive target
    y_target = (train['return'].values > 0.0).astype(int)
    docs_with_word_and_target = X_train.T.dot(y_target)
    docs_with_word = np.array(X_train.sum(axis=0))
    freq = docs_with_word_and_target / docs_with_word
    freq = np.asarray(freq).ravel()
    
    # Get Results
    freq_df = pd.Series(freq, index=words).to_frame(name='polarity')
    freq_df['count'] = word_counts
    freq_df = freq_df.reset_index()
    freq_df = freq_df.rename(columns={'index': 'word'})
    return X_train, X_test, freq_df



if __name__ == '__main__':
    
    # python 4_train_for_rapport.py
    # '/home/ubuntu/internal_omicron/cppib_data/boeing-nlp-prep-201501-202011.parquet'
    # '/home/ubuntu/internal_omicron/cppib_data/BA.csv'
    # '2015-01-05'
    # '2020-11-06'
    # 'LR'
    # 0.0
    # 0.0
    # 0
    # 0.5
    
    args = sys.argv
    
    if len(args) == 10:
        text_data_path, return_data_path = args[1], args[2]
        start_date, end_date = args[3], args[4]
        model = args[5]
        SESTM_ALPHA_PLUS, SESTM_ALPHA_MINUS, SESTM_KAPPA, SESTM_LAMBDA = args[6], args[7], args[8], args[9]
        
        # Create output folder
        if model == 'SESTM':
            output_folder = 'trial_{}_{}_{}_{}'.format(SESTM_ALPHA_PLUS, SESTM_ALPHA_MINUS, SESTM_KAPPA, SESTM_LAMBDA)
        else:
            output_folder = 'trial_02'
        
        os.makedirs('../results/{}/{}/'.format(model, output_folder), exist_ok=True)
        os.makedirs('../results/{}/{}/preds/'.format(model, output_folder), exist_ok=True)
        os.makedirs('../results/{}/{}/topics/'.format(model, output_folder), exist_ok=True)
        print('Created folder: ../results/{}/{}/'.format(model, output_folder))
        
        ### 1 -- TARGET DATA
        # Load Return Data
        print('TARGET DATA\n')
        horizon = 10
        target = 'Open'
        target_data = pd.read_csv(return_data_path)
        target_data = target_data.rename(columns={'Date': 'date'})
        target_data['date'] = pd.to_datetime(target_data['date'])
        target_data = target_data.sort_values('date')
        
        # Create target: Forward Cumulative Returns
        # (in the next 't' days, what are the cumulative returns? Are they + or -?)
        target_data['return'] = target_data[target] \
                                            .pct_change() \
                                            .rolling(horizon) \
                                            .apply(lambda x: np.sum(x)) \
                                            .shift(-horizon - 1) \
                                            .dropna(how="all")
        target_data['date'] = target_data['date'].astype(str)
        target_dict = dict(zip(target_data['date'], target_data['return']))
        
        ### 2 -- TEXT DATA
        # Open Text Data and Add Return Data
        print('TEXT DATA')
        text_data = pd.read_parquet(text_data_path)
        print('{}\n'.format(len(text_data)))
        text_data['return'] = text_data['date'].map(target_dict)
        text_data['date'] = pd.to_datetime(text_data['date'])
        text_data = text_data.dropna(subset=['return'])
        text_data = text_data.set_index('date')

        ### 3 -- MODEL TRAINING & PREDICTIONS
        # Prepare Dates
        dates_bw_start = pd.date_range(start=start_date, end=end_date, freq='W-MON')
        dates_bw_end = pd.date_range(start=start_date, end=end_date, freq='W-FRI')
        dates = [(d1, d2) for d1, d2 in zip(dates_bw_start, dates_bw_end)]       
        
        # Loop over periods to obtain vocabulary sets
        print('MODEL')
        
        for i, _ in enumerate(dates):

            # Get Time Period (12 weeks = 3 months). Ideal: 6 months (24 weeks)
            train_period = dates[i:i+12]

            if (len(train_period) == 12) and (train_period[-1][-1] != dates[-1][-1]):

                test_period = dates[i+12]
                start = train_period[0][0]
                end = train_period[-1][-1]
                
                # Get Training/Testing Data
                temp_train_data = text_data.loc[start:end, :]
                temp_test_data = text_data.loc[test_period[0]:test_period[1], :]
                
                X_train, X_test, temp_vocab = get_vocab_per_regime(temp_train_data, temp_test_data)
                
                y_train = temp_train_data['return'].values
                y_test = temp_test_data['return'].values
                    
                temp_vocab['train_start'] = start
                temp_vocab['train_end'] = end
                temp_vocab['test_start'] = test_period[0]
                temp_vocab['test_end'] =  test_period[1]
                
                progress_string = '{} - Train: {} ==> {}\tTest: {} ==> {}'.format(
                    str(datetime.now()).rsplit('.')[0],
                    start.date(),
                    end.date(),
                    test_period[0].date(),
                    test_period[1].date()
                )
                print(progress_string)
                
                temp_vocab.to_csv('../results/vocab/trial_02/{}_vocab_12W.csv'.format(
                    test_period[0].date()
                ))
                
                
                # Launch Model
                if model == 'SESTM':
                    
                    temp_threshold = temp_vocab['polarity'].mean()
                    
                    # Model Training
                    if temp_vocab['polarity'].nunique() == 1:
                        print('WARNING: UNIQUE VALUE. NO EXCLUSION.')
                        sestm_model = SESTM(
                            0.0,
                            0.0,
                            int(SESTM_KAPPA),
                            float(SESTM_LAMBDA),
                            threshold=temp_threshold,
                            vocab=temp_vocab['word'].values,
                            stopwords=STOPWORDS
                        )
                    else:
                        sestm_model = SESTM(
                            float(SESTM_ALPHA_PLUS),
                            float(SESTM_ALPHA_MINUS),
                            int(SESTM_KAPPA),
                            float(SESTM_LAMBDA),
                            threshold=temp_threshold,
                            vocab=temp_vocab['word'].values,
                            stopwords=STOPWORDS
                        )                        
                    
                    sestm_model.fit(temp_train_data)
                    
                    # Predictions
                    temp_pred_test = sestm_model.predict(temp_test_data)
                    
                    # Topics
                    temp_O_hat = sestm_model._O_hat_df.T
                    temp_O_hat['train_start'] = start
                    temp_O_hat['train_end'] = end
                    temp_O_hat['test_start'] = test_period[0]
                    temp_O_hat['test_end'] =  test_period[1]
                    
                    temp_O_hat.to_csv('../results/{}/{}/topics/{}_topics.csv'.format(
                        model, output_folder, test_period[0].date()
                    ))
                    
                    
                elif model == 'NB':
                    
                    # Predictions
                    y_train = (y_train > 0.0).astype(int)
                    nb_model = ComplementNB()
                    nb_model = nb_model.fit(X=X_train, y=y_train)
                    temp_pred_test = nb_model.predict_proba(X_test)[:,1]
                    # print(roc_auc_score((y_test > 0.0).astype(int), temp_pred_test))
                
                elif model == 'LR':
                    
                    # Predictions
                    y_train = (y_train > 0.0).astype(int)
                    lr_model = LogisticRegression(C=100, class_weight='balanced',
                                                  solver='lbfgs', random_state=SEED)
                    lr_model = lr_model.fit(X=X_train, y=y_train)
                    temp_pred_test = lr_model.predict_proba(X_test)[:,1]              
                    
                    # Explainer
                    # explainer = shap.Explainer(lr_model, X_train)
                    # shap_values = explainer.shap_values(X_test).mean(axis=0)
                    # shap_values = pd.Series(shap_values, index=temp_vocab['word'].values)
                    
                    # shap_values.to_csv('../results/{}/{}/topics/{}_topics.csv'.format(
                    #     model, output_folder, test_period[0].date()
                    # ))
                    
                elif model == 'LGBM':
                    
                    lgb_params = {
                        'objective': 'binary',
                        'boosting': 'gbdt',
                        'num_iterations': 100,
                        'num_threads': N_JOBS,
                        'seed': SEED,
                        'max_depth': 6,
                        'lambda_l1': 0.001,
                        'lambda_l2': 0.001,
                        'num_leaves': 100,
                        'verbosity': 0
                    }
                    
                    # Predictions
                    lgb_train = lgb.Dataset(X_train.astype(np.float32), label=y_train) 
                    # lgb_test = lgb.Dataset(X_test.astype(np.float32), label=y_test)
                    lgb_model = lgb.train(lgb_params, lgb_train, verbose_eval=50)
                    temp_pred_test = lgb_model.predict(X_test.astype(np.float32))
                
                elif model == 'LLDA':
                    
                    # Prepare Data
                    frequent_words = temp_vocab['word'].values.tolist()
                    # print(len(frequent_words))
                    corpus_train = tp.utils.Corpus(stopwords=STOPWORDS)
                    temp_train_data['return'] = (temp_train_data['return'] > 0.0).astype(int)
                    temp_train_data = temp_train_data.sort_values('return')
                    
                    for doc in temp_train_data.itertuples():
                        corpus_train.add_doc(words=doc[1].split(' '), labels=[str(doc[2])])
                    
                    # Train LLDA
                    llda_model = tp.LLDAModel(corpus=corpus_train, seed=SEED)
                    llda_model.train(100)
                    
                    # Predictions
                    test_corpus = [
                        llda_model.make_doc(doc.split(' ')) for doc in temp_test_data['text']
                    ]
                    y_pred, _ = llda_model.infer(test_corpus, iter=100, workers=N_JOBS)

                    if temp_train_data['return'].mean() == 1.0:
                        y_pred_df = pd.DataFrame(y_pred, columns=['POS'])
                        y_pred_df['NEG'] = 1.0 - y_pred_df['POS']
                        
                    elif temp_train_data['return'].mean() == 0.0:
                        y_pred_df = pd.DataFrame(y_pred, columns=['NEG'])
                        y_pred_df['POS'] = 1.0 - y_pred_df['NEG']
                        
                    else:
                        y_pred_df = pd.DataFrame(y_pred, columns=['NEG', 'POS'])
                    
                    temp_pred_test = y_pred_df['POS'].values
                    
                    # Topics
                    topic_word_mat = np.stack([llda_model.get_topic_word_dist(k) for k in range(llda_model.k)])
                    vocab = np.array(llda_model.used_vocabs)
                    
                    if temp_train_data['return'].mean() == 1.0:
                        print('UNIQUE LABEL: 1.0')
                        llda_topics = pd.DataFrame(topic_word_mat.T, columns=['pos'], index=vocab)
                        
                    elif temp_train_data['return'].mean() == 0.0:
                        print('UNIQUE LABEL: 0.0')
                        llda_topics = pd.DataFrame(topic_word_mat.T, columns=['neg'], index=vocab)
                        
                    else:
                        llda_topics = pd.DataFrame(topic_word_mat.T, columns=['neg', 'pos'], index=vocab)
                    
                    llda_topics['train_start'] = start
                    llda_topics['train_end'] = end
                    llda_topics['test_start'] = test_period[0]
                    llda_topics['test_end'] =  test_period[1]
                    
                    llda_topics.to_csv('../results/{}/{}/topics/{}_topics.csv'.format(
                        model, output_folder, test_period[0].date()
                    ))
                    
                    
                else:
                    raise('Erroneous argument for model provided. Must be SESTM, LR or LLDA.')
                
                # Save Predictions
                temp_pred_test = pd.DataFrame({
                    'pred': temp_pred_test, 'dates': temp_test_data.index.get_level_values(0),
                    'true': temp_test_data['return'].values
                })
                
                temp_pred_test.to_csv('../results/{}/{}/preds/{}_preds.csv'.format(
                    model, output_folder, test_period[0].date()
                ))
                temp_pred_test
            else:
                print('Error with dates.')
                break
        
        # print('')
        # print('ROC AUC Test\t:', roc_auc_score(pred_test['true'], pred_test['pred']), '\n')
        
    else:
        print('Irregular number of arguments. Requires 7 not {}.'.format(len(args)-1))


# sestm.py

In [None]:
import pandas as pd
import numpy as np
from scipy.optimize import minimize_scalar
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.stem.porter import PorterStemmer
import re

class SESTM:
    '''
    Implements a new text-mining methodology that extracts sentiment information from news
    articles to predict asset returns.
    The algorithm is detailed in: https://bfi.uchicago.edu/wp-content/uploads/BFI_WP_201969.pdf
    '''

    def __init__(self, alpha_plus, alpha_minus, kappa, lambda_reg, stopwords,
                 threshold=0.5, vocab=None):
        self.df = None
        self.alpha_plus = alpha_plus
        self.alpha_minus = alpha_minus
        self.kappa = kappa
        self.lambda_reg = lambda_reg
        self._O_hat_df = None
        self.word_count = None
        self._cv = None
        self._fj_vector = None
        self._word_count_pred = None
        self._opt_res = None
        self.stopwords = stopwords
        self.vocab = vocab
        self.threshold = threshold
        

    def _compute_word_count(self, df):
        '''computes word count of the dataframe df.'''
        self.df = df
        self.cv = CountVectorizer(
            stop_words=self.stopwords,
            token_pattern=r"\b[a-zA-Z]{3,}\b",
            ngram_range=(1, 1),
            vocabulary=self.vocab
        )
        self.cv.fit(self.df['text'].values)
        self.word_count = self.cv.transform(self.df['text'].values)


    def print_word_count_df(self):
        return pd.DataFrame.sparse.from_spmatrix(self.word_count , columns=self.cv.get_feature_names())


    def _compute_fj(self):
        ''' computes the fj vector that gives the frequence at which the word appears with a positive return'''
        num = (self.word_count != 0).astype(int).T.dot((self.df['return'].values > 0))
        den = np.array((self.word_count != 0).astype(int).sum(axis=0))
        return num / den

    def _get_non_neutral_words(self):
        ''' remove neutral words from the dictionary'''
        mask_pos = self._fj_vector >= (self.threshold + self.alpha_plus)
        mask_neg = self._fj_vector < (self.threshold - self.alpha_minus)
        mask_freq = np.array(self.word_count.sum(axis=0))[0] >= self.kappa
        mask_filtered_tokens = np.logical_and(np.logical_or(mask_pos, mask_neg), mask_freq)
        return mask_filtered_tokens.ravel()

    def _compute_p_hat(self):
        ''' computes p_hat that is a return's proxy (more stable)'''
        return self.df['return'].rank().values / len(self.df['return'])


    def _minus_log_likelihood(self, p ):
        ''' computes the log likelihood of the multinomial law '''

        subset_of_training = self._word_count_pred.columns[(self._word_count_pred != 0).values[0]]
        word_count = self._word_count_pred[subset_of_training].T.values
        log_term = np.log(p * self._O_hat_df[subset_of_training].loc['pos', :].values.reshape(-1, 1) + \
                          (1 - p) * self._O_hat_df[subset_of_training].loc['neg', :].values.reshape(-1, 1))
        log_term[log_term == -np.inf] = 0
        log_term[log_term == np.inf] = 0

        reg_term = self.lambda_reg * np.log(p * (1 - p))
        term = word_count * log_term + reg_term
        return - np.sum(term)

    def fit(self, df):
        '''train the model'''
        self._compute_word_count(df)
        self._fj_vector = self._compute_fj()
        mask_filtered_tokens = self._get_non_neutral_words()
        S_hat = self.word_count[:,mask_filtered_tokens]
        p_hat = self._compute_p_hat()
        W_hat = np.array([p_hat, 1 - p_hat])
        #fillna 0 because some filtered words let some documents empty, which leads to NaN
        D_hat = S_hat.T.multiply(1/np.array(S_hat.sum(axis=1)).reshape(1,-1)[0])
        O_hat = np.dot(D_hat.dot(W_hat.T), np.linalg.inv(np.dot(W_hat, W_hat.T)))
        O_hat_plus = O_hat.clip(min=0)[:, 0] / O_hat.clip(min=0)[:, 0].sum()
        O_hat_minus = O_hat.clip(min=0)[:, 1] / O_hat.clip(min=0)[:, 1].sum()
        O_hat = np.array([O_hat_plus, O_hat_minus])
        self._O_hat_df = self._format_O_hat_df(O_hat, mask_filtered_tokens)

    def _format_O_hat_df(self, O_hat, mask_filtered_tokens):
        columns = np.array(self.cv.get_feature_names())[mask_filtered_tokens]
        return pd.DataFrame(O_hat,
                            columns= columns,
                            index=['pos', 'neg'])



    def predict(self, df):
        '''predict the positivity of a new document'''
        word_count_pred = pd.DataFrame.sparse.from_spmatrix(
                self.cv.transform(df['text']), columns=self.cv.get_feature_names()
                )
        
        # Filter out neutral-sentiment words from test set
        mask_filtered_tokens = self._get_non_neutral_words()
        non_neutral_words = np.array(self.cv.get_feature_names())[mask_filtered_tokens]
        word_count_pred = word_count_pred[non_neutral_words]
        
        preds = []
        for _, doc in word_count_pred.iterrows():
            self._word_count_pred = doc.to_frame().T
            self._opt_res = minimize_scalar(self._minus_log_likelihood, bounds=(0, 1), method='bounded')
            p_star = self._opt_res.x
            preds.append(p_star)
        
        return np.array(preds)

    def _get_topics(self, n):
        '''returns positive and negative topics'''
        pos_freq_words = (self._O_hat_df.loc['pos'].sort_values(ascending=False)[:n] * 100000).astype(int).to_dict()
        neg_freq_words = (self._O_hat_df.loc['neg'].sort_values(ascending=False)[:n] * 100000).astype(int).to_dict()
        return pos_freq_words, neg_freq_words

    def makeImage(self, text, show=False):
        '''helper method to plot a wordcloud in a circular fashion'''
        x, y = np.ogrid[:1000, :1000]
        mask = (x - 500) ** 2 + (y - 500) ** 2 > 500 ** 2
        mask = 255 * mask.astype(int)
        wc = WordCloud(background_color="white", repeat=True, mask=mask)
        # generate word cloud
        wc.generate_from_frequencies(text)
        # show
        plt.imshow(wc, interpolation="bilinear")
        plt.axis("off")
        plt.show()

    def _plot_pos_topics(self, n):
        '''plot wordcloud of positive topics'''

        pos_freq_words = self._get_topics(n)[0]
        self.makeImage(pos_freq_words, show=False)

    def _plot_neg_topics(self, n):
        '''plot wordcloud of negative topics'''

        neg_freq_words = self._get_topics(n)[1]
        self.makeImage(neg_freq_words, show=False)

    def plot_topics(self, n):
        '''plot wordcloud of positive and negative topics'''
        print('Preparing topics plots ... ')
        self._plot_pos_topics(n)
        self._plot_neg_topics(n)


# 1_preproc_json_to_dataframe.py 

In [None]:
# import sys
import numpy as np
import pandas as pd
import json
import vaex


def read_huge_json_to_dataframe(path):
    """ Utility to read NLP +10GB JSON files into DataFrames
    """
    # pandas.read_json crashes RAM
    with open(path) as f:
        data = pd.DataFrame(json.loads(line) for line in f)
    
    # Meta info (site, site_type, country) is a column
    # of dictionnaries
    meta = pd.DataFrame(list(data['thread']))
    data = pd.concat([data, meta], axis=1, ignore_index=False)
    del data['thread']
    
    # Some titles are under list form
    data['title'] = data['title'].apply(lambda x: '' if isinstance(x, list) else x)
    
    # Country has mixed types (str, float, list)
    data['country'] = data['country'].apply(lambda x: '' if isinstance(x, list) else x)
    data['country'] = data['country'].fillna('')
    
    return data


if __name__ == '__main__':
    
    # Get path for raw data
    # Examples:
    # raw_path = '/home/ubuntu/data/boeing.json'
    # new_path = '/home/ubuntu/internal_omicron/cppib_data/boeing-nlp-data-201501-202011.hdf5'
    args = sys.argv
    
    if len(args) == 3:
        raw_path, new_path = args[1], args[2]

        # Load JSON Dataset
        data = read_huge_json_to_dataframe(raw_path)

        # Convert Pandas DataFrame to Vaex DataFrame
        vaex_df = vaex.from_pandas(data, copy_index=False)

        # Combine Title and Text Body (this operation in Pandas leads to
        # kernel shutdown)
        vaex_df['text'] = vaex_df['title'] + ' ' + vaex_df['text']

        # Lowercase
        vaex_df['text'] = vaex_df['text'].str.lower()
        
        # Save Vaex Format to HDF5 Format
        vaex_df.export_hdf5(new_path)
        print('Export completed.')
        
    else:
        print('Irregular number of arguments. Requires 2 not {}.'.format(len(args)-1))


# 2_preproc_raw_to_clean.py

In [None]:
import sys
import vaex
import re
import pandas as pd
import spacy
from nltk.corpus import stopwords
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en.stop_words import STOP_WORDS as spacy_stopwords
from dask_ml.feature_extraction.text import CountVectorizer as DaskMLCountVectorizer


# Prepare stopwords list
nltk_stopwords = list(stopwords.words('english'))
spacy_stopwords = list(spacy_stopwords)
STOPWORDS = list(set(nltk_stopwords + spacy_stopwords))
STOPWORDS += ['january', 'february', 'march', 'april', 'june', 'july', 'august',
              'september', 'october', 'november', 'december', 'com', 'http',
              'https', 'said', 'like', 'new', 'year', 'years', 'news']

# Speeds up massively Spacy Lemmatization
nlp = spacy.load('en_core_web_lg', disable=['tagger', 'parser', 'ner'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))


def lemmatize_pipe(doc):
    """ AAA
    """
    lemma_list = [str(tok.lemma_).lower() for tok in doc
                  if ((tok.is_alpha) and
                      (tok.lemma_ not in STOPWORDS)) and
                      (len(tok.lemma_) >= 3)]
    lemma = ' '.join(lemma_list)
    return lemma

def preprocess_pipe(texts):
    """ AAA
    """
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=200):
        preproc_pipe.append(lemmatize_pipe(doc))
    return preproc_pipe


if __name__ == '__main__':
    
    # Get paths for data
    # Examples:
    # new_path = '/home/ubuntu/internal_omicron/cppib_data/boeing-nlp-data-201501-202011.hdf5'
    # clean_path = '/home/ubuntu/internal_omicron/cppib_data/boeing-nlp-prep-201501-202011.parquet'
    args = sys.argv
    
    if len(args) == 3:
        new_path, clean_path = args[1], args[2]

        # Open Data
        data = vaex.open(new_path) #.head(100000)
        data = data[data['text'].str.len() <= 300000]
        print('Loading {} documents'.format(len(data)))
        
        # Dates
        data['crawled'] = data['crawled'].astype('datetime64[ns]')
        data['date'] = data['crawled'].astype('datetime64[D]')
        data['date'] = data['date'].astype(str)

        # Text Preprocessing with Regular Expressions
        punct = """-!"'#&$%\()*+,.:;<=>?@[\\]^_`{|}~–’"""
        punc_pattern = '{}'.format('|'.join(['\\'+char for char in punct]))
        full_regex_pattern = r'(\S*@\S*\s?|http\S+|https\S+|\s+|{})'.format(punc_pattern)
        data['text'] = data['text'].str.replace(full_regex_pattern, ' ', regex=True)

        # Lemmatization
        print('Regular expressions')
        data_pandas = data.to_pandas_df(column_names=['date', 'text'])
        
        print('Lemmatization')
        data_pandas['text'] = preprocess_pipe(data_pandas['text'])
        
        # Export to Parquet
        data_pandas.to_parquet(clean_path)
        print('Exported to parquet')
        
    else:
        print('Irregular number of arguments. Requires 2 not {}.'.format(len(args)-1))
