<a href="https://colab.research.google.com/github/Krankile/npmf/blob/main/notebooks/initial_dataprocessing_stockvalue.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Setup

##Kernel setup

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%%capture
!git clone https://github.com/Krankile/npmf.git
!pip install wandb

In [3]:
%%capture
!cd npmf && git pull

In [4]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mkrankile[0m (use `wandb login --relogin` to force relogin)


##General setup

In [5]:
import os
from collections import defaultdict
from datetime import datetime
from operator import itemgetter

import numpy as np
from numpy.ma.core import outerproduct
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from tqdm import tqdm

import wandb as wb

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

from npmf.utils.colors import main, main2, main3

In [6]:
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=[main, main2, main3, "black"])
mpl.rcParams['figure.figsize'] = (16, 9)

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cpu device


In [8]:
np.random.seed(420)

# Let's get the data and and check it 😂✨KAWAIII ^^✨



In [9]:
def get_df_artifact(name):
    with wb.init(project="master-test") as run:
        art = run.use_artifact(name)
        art.download()
        filepath = art.file()

        return pd.read_feather(filepath)

In [10]:
data = get_df_artifact("oil-company-data:v1").set_index("Instrument")

[34m[1mwandb[0m: Currently logged in as: [33mkrankile[0m (use `wandb login --relogin` to force relogin)


[34m[1mwandb[0m: Downloading large artifact oil-company-data:v1, 461.61MB. 1 files... Done. 0:0:0


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

In [11]:
data = data[~(data.Date == "")].astype({"Date": np.datetime64})
data

Unnamed: 0_level_0,Date,Company Market Cap,Price Close,Currency
Instrument,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GGX.AX,2005-05-13,4510832.105479,0.116764,USD
GGX.AX,2005-05-16,4486502.643904,0.116764,USD
GGX.AX,2005-05-17,4483522.209137,0.116057,USD
GGX.AX,2005-05-18,4519172.59536,0.116057,USD
GGX.AX,2005-05-19,3827143.993207,0.099066,USD
...,...,...,...,...
AEC.V,2022-04-14,43250854.906096,0.10716,USD
AEC.V,2022-04-18,43082857.134364,0.107015,USD
AEC.V,2022-04-19,43079442.196417,0.107015,USD
AEC.V,2022-04-20,43486177.208353,0.107015,USD


### Number of unique companies

In [12]:
tickers = data[~data["Company Market Cap"].isna()].index.unique()

print(f"{tickers.shape[0]} unique companies in the set")

1697 unique companies in the set


### Number of datapoints for the market cap

In [44]:
marketcaps = data.shape[0]
print(f"There is a total of {marketcaps} datapoints in the dataset")

marketcaps = data[~data["Company Market Cap"].isna()].shape[0]
print(f"There is a total of {marketcaps} datapoints in the dataset that are not NAs")

There is a total of 6230549 datapoints in the dataset
There is a total of 6117652 datapoints in the dataset that are not NAs


## Start process of sorting out data

First, cut of leading and trailing NAs for all companies.

### Remove leading and trailing NAs

In [45]:
stripped = []
points_stripped = 0

for ticker in tqdm(tickers):
    d = data.loc[ticker, ]
    start = 0
    end = d.shape[0]

    # Start at the beginning and find first real value
    for i, val in enumerate(d["Company Market Cap"]):
        if not pd.isna(val):
            start = i
            break
    points_stripped += i

    for i, val in enumerate(d["Company Market Cap"][::-1]):
        if not pd.isna(val):
            end -= i
            break
    points_stripped += i

    stripped.append(d.iloc[start:end, ])

stripped = pd.concat(stripped, axis=0)

print(f"Total of {stripped.shape[0]} datapoints after stripping")

100%|██████████| 1697/1697 [03:35<00:00,  7.87it/s]


Total of 6169474 datapoints after stripping


Then, consider different strategies for sorting out companies. For example

1.   Make sure companies have less than $p$% NAs for all quarters
2.   Take out companies where each string of missing values are longer than $n$

### Remove companies with long streaks of NAs

In [56]:
def na_streak(ser, k):

    curr = 0

    for val in ser:
        if not pd.isna(val):
            curr = 0
            continue
        
        curr += 1

        if curr >= k:
            return True
    
    return False


In [None]:
companies_dropped = True
k = 1
result = dict(k=list(), companies=list(), points=list())

left = set(tickers)

keep = []
while companies_dropped:
    print(f"Processing for k = {k}")

    companies_dropped = False
    for ticker in tqdm(left.copy()):
        s = stripped.loc[ticker, ]
        if na_streak(s["Company Market Cap"], k):
            companies_dropped = True
        else:
            left.remove(ticker)
            keep.append(s)

    keepdf = pd.concat(keep, axis=0)

    result["k"].append(k)
    result["companies"].append(keepdf.index.unique().shape[0])
    result["points"].append(keepdf.shape[0])

    k += 1


Processing for k = 1


100%|██████████| 1697/1697 [03:34<00:00,  7.92it/s]


Processing for k = 2


100%|██████████| 108/108 [00:13<00:00,  7.95it/s]


Processing for k = 3


100%|██████████| 107/107 [00:13<00:00,  7.94it/s]


Processing for k = 4


100%|██████████| 106/106 [00:13<00:00,  7.89it/s]


Processing for k = 5


100%|██████████| 105/105 [00:13<00:00,  7.93it/s]


Processing for k = 6


100%|██████████| 105/105 [00:13<00:00,  7.92it/s]


Processing for k = 7


100%|██████████| 104/104 [00:13<00:00,  7.97it/s]


Processing for k = 8


100%|██████████| 103/103 [00:12<00:00,  7.94it/s]


Processing for k = 9


100%|██████████| 103/103 [00:12<00:00,  7.94it/s]


Processing for k = 10


100%|██████████| 103/103 [00:12<00:00,  7.95it/s]


Processing for k = 11


100%|██████████| 103/103 [00:12<00:00,  7.96it/s]


Processing for k = 12


 88%|████████▊ | 91/103 [00:11<00:01,  7.93it/s]

In [70]:
result

{'companies': [1589, 1, 1, 1],
 'k': [1, 2, 3, 4],
 'points': [5729606, 5651, 5268, 5651]}