# Quantification

In [450]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import json
import os
import matplotlib.pyplot as plt

## Load Data

In [451]:
stock: pd.DataFrame = pd.read_pickle("../../data/TSLA.pkl")
stock

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-11-02,20.008667,20.579332,19.508667,19.950666,296871000
2017-11-03,19.966667,20.416668,19.675333,20.406000,133410000
2017-11-06,20.466667,20.500000,19.934000,20.185333,97290000
2017-11-07,20.068001,20.433332,20.002001,20.403334,79414500
2017-11-08,20.366667,20.459333,20.086666,20.292667,70879500
...,...,...,...,...,...
2022-10-26,219.399994,230.600006,218.199997,224.639999,85012500
2022-10-27,229.770004,233.809998,222.850006,225.089996,61638800
2022-10-28,225.399994,228.860001,216.350006,228.520004,69152400
2022-10-31,226.190002,229.850006,221.940002,227.539993,61554300


Split training and test data sets:

In [452]:
num_days = 90
train_df = stock[:-num_days]
test_df = stock[-num_days:]

## Combine Stocks and News

Intuitively, the news information can only provide us with very vague information about the stock price. In fact, it is already good enough if today's news can tell us whether the stock price will increase tomorrow.

In [453]:
# shifted percentage change: (tomorrow - today) / today
train_df = train_df[["Close"]].pct_change().shift(-1).dropna()

# whether the stock price will increase compared with today's price
train_df = train_df > 0
train_df.rename(columns={"Close": "Will Go Up?"}, inplace=True)
train_df

Unnamed: 0_level_0,Will Go Up?
Date,Unnamed: 1_level_1
2017-11-02,True
2017-11-03,False
2017-11-06,True
2017-11-07,False
2017-11-08,False
...,...
2022-06-16,True
2022-06-17,True
2022-06-21,False
2022-06-22,False


Now, we attach the news headline on each date to the data frame:

In [454]:
train_df["Headline"] = np.nan

for date in train_df.index:
    date_str = datetime.strftime(date, "%Y-%m-%d")
    news_filepath = os.path.join("../../data/news/", f"TSLA/{date_str}.json")
    
    if not os.path.exists(news_filepath):
        continue
    
    with open(news_filepath, "r") as f:
        news = json.load(f)
    
    train_df.loc[date, "Headline"] = news["Headline"]

train_df = train_df[["Headline", "Will Go Up?"]]
train_df

Unnamed: 0_level_0,Headline,Will Go Up?
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-11-02,"The GOP tax bill kills a $7,500 electric-vehic...",True
2017-11-03,Tesla hits bumps in pursuit of mass market,False
2017-11-06,Tesla Investors May Be Losing Patience,True
2017-11-07,Bloomberg,False
2017-11-08,Tesla buys Perbix industrial automation company,False
...,...,...
2022-06-16,Tesla to charge more for cars in United States...,True
2022-06-17,"Tesla Raises EV Prices By as Much as $6,000 USD",True
2022-06-21,Bloomberg,False
2022-06-22,EXCLUSIVE Tesla plans 2-week suspension for mo...,False


## Word Embedding

Our goal is to predict whether the stock price will go up tomorrow based on today's news. So, it can be treated as a **classification** problem, more precisely, a **binary classification** problem. That is, we want to find a mapping / decision rule:

$$
\text{News Headline (Text)} \mapsto \{\text{True}, \text{False}\}
$$

But the headlines consist of text data. Hence, we need to first transform or encode them to numerical data. This procedure is called **word embedding**.

Usually, we need to train an additional model for word embedding, which is quite troublesome. To reduce our workload, fortunately, there is a pretrained model prodived by the module `sentence_transformers` that is ready to use.

In [455]:
from sentence_transformers import SentenceTransformer

```{seealso}
Check the [document](https://www.sbert.net/#) of sentence_transformers for more information.
```

Encode the news headlines with the model `SentenceTransformer`:

In [456]:
# load pretrained model
headline_encoder = SentenceTransformer("all-MiniLM-L6-v2")

# encode news headlines to NumPy arrays
embeddings = headline_encoder.encode(train_df["Headline"].tolist())

print(f"embeddings.shape: {embeddings.shape}")
embeddings

embeddings.shape: (1167, 384)


array([[ 0.02502805,  0.05323019,  0.09939805, ..., -0.04196155,
         0.04668063,  0.05590457],
       [-0.01846967, -0.02873902,  0.03534317, ..., -0.13485262,
         0.01329796,  0.13287042],
       [ 0.02563943, -0.04956274,  0.04786101, ..., -0.07707552,
        -0.05291679,  0.0715071 ],
       ...,
       [-0.01015677,  0.00959312, -0.01810676, ..., -0.03960473,
         0.00950317,  0.05490932],
       [-0.07479677,  0.01877449,  0.07296403, ..., -0.08446505,
         0.02360164,  0.05853827],
       [ 0.01585246,  0.05186599,  0.07435998, ..., -0.07771911,
         0.01394665,  0.05791727]], dtype=float32)

As we can see, each headline is transformed to an array with length 384.