<a href="https://colab.research.google.com/github/Krankile/npmf/blob/main/notebooks/univariate_parametric_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setup

##Kernel setup

In [14]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [15]:
%%capture
!git clone https://github.com/Krankile/npmf.git
!pip install wandb

In [23]:
%%capture
!cd npmf && git pull

In [17]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mkrankile[0m (use `wandb login --relogin` to force relogin)


##General setup

In [24]:
import os
from collections import defaultdict
from collections import Counter
from datetime import datetime
from operator import itemgetter

import numpy as np
from numpy.ma.core import outerproduct
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from tqdm import tqdm

import wandb as wb

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

from npmf.utils.colors import main, main2, main3
from npmf.utils.wandb import get_df_artifact
from npmf.utils.eikon import column_mapping

In [19]:
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=[main, main2, main3, "black"])
mpl.rcParams['figure.figsize'] = (6, 4)  # (6, 4) is default and used in the paper

In [20]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cpu device


In [21]:
np.random.seed(420)

# Get data from WandB 😂✨KAWAIII ^^✨ 

In [25]:
data = get_df_artifact("stock-data-clean-k-5:v0").rename(columns=column_mapping)
data

[34m[1mwandb[0m: Downloading large artifact stock-data-clean-k-5:v0, 135.04MB. 1 files... Done. 0:0:0


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

Unnamed: 0,ticker,date,market_cap,close_price,currency
0,SZR.WA,2010-02-19,4316546.76259,0.061665,USD
1,SZR.WA,2010-02-22,6255155.347814,0.089359,USD
2,SZR.WA,2010-02-23,7567823.237272,0.108112,USD
3,SZR.WA,2010-02-24,7851000.917649,0.112157,USD
4,SZR.WA,2010-02-25,8833862.00075,0.126198,USD
...,...,...,...,...,...
5747788,300772.SZ,2022-04-15,1421837745.71855,2.621458,USD
5747789,300772.SZ,2022-04-18,1454200172.46267,2.681125,USD
5747790,300772.SZ,2022-04-19,1438043708.32942,2.651337,USD
5747791,300772.SZ,2022-04-20,1377339011.94927,2.539415,USD


### Create a split of data for training and testing

All data from, and including, January 1st, 2019 will be reserved for the test set. All other data can be used for training.

In [30]:
train_test_split = data.date < pd.to_datetime("01.01.2019")

train = data[train_test_split]
test = data[~train_test_split]

train.shape, test.shape

((4484814, 5), (1262979, 5))

After the split, the training set contains ~4.5M stock prices, while the test set has ~1.3M. There are ~5 trading days in a week, ~20 in a month, ~60 in a quarter, and ~250 in a year. If forecasting a month into the future, it is reasonable to require at least a quarter's worth of training data, minimum. Therefore, we require included companies to have at least a quarter plus a month worth of days (i.e., 80) in the training set to be included.

In [46]:
counts = train.groupby("ticker").count()

exclude_tickers = counts[counts.market_cap < 80]
exclude_tickers.index.unique().shape[0], exclude_tickers.date.sum(axis=0), exclude_tickers.date.sum(axis=0) / train.shape[0] * 1000

(16, 668, 0.14894709122830957)

When imposing this requirement, we lose 16 tickers, and ~700 datapoints, or 0.15‰ of the total training data.

In [49]:
train = train[~train.ticker.isin(exclude_tickers.index)]
test = test[~test.ticker.isin(exclude_tickers.index)]

train.shape, test.shape

((4484146, 5), (1250576, 5))

In [54]:
train.ticker.unique()

array(['SZR.WA', 'EZHL.SI', 'UZMA.KL', ..., '601857.SS', '600256.SS',
       '300370.SZ'], dtype=object)

In [57]:
test_ser = train[train.ticker == "SZR.WA"].loc[:, ["date", "market_cap"]].set_index("date")
test_ser.squeeze()

date
2010-02-19     4316546.76259
2010-02-22    6255155.347814
2010-02-23    7567823.237272
2010-02-24    7851000.917649
2010-02-25     8833862.00075
                   ...      
2018-12-19    1166613.638774
2018-12-20    1167511.343434
2018-12-21     881692.850273
2018-12-27     878758.021996
2018-12-28     878033.205619
Name: market_cap, Length: 2216, dtype: Float64