In [14]:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta

# Introduction

Forecasting financial time series is not something that is easy to pick up the first time. Even though we might be acquainted with different machine learning classification and regression problems, it poses its own unique challenges in terms of autocorrelation, format of the data, evaluation of predictions.
In this notebook and articles series we will try to frame a time series problem in format that is closer to common machine learning problems so that we can use frameworks such as gradient boosted machines or DNN to tackle our problem.\

As such I will not discuss here methods that are commonly used in Academiy such as VAR, ARIMA, SARIMA and variations thereof that require much deeper knowledge about the theory of the data generating process and that are in my opinion cumbersome for people that have not encountered them before. \

I am going to use yahoo finance data as an example below!

In [2]:
data = yf.download(  # or pdr.get_data_yahoo(...
        # tickers list or string as well
        tickers = "SPY AAPL MSFT",

        # use "period" instead of start/end
        # valid periods: 1d,5d,1mo,3mo,6mo,1y,2y,5y,10y,ytd,max
        # (optional, default is '1mo')
        period = "ytd",

        # fetch data by interval (including intraday if period < 60 days)
        # valid intervals: 1m,2m,5m,15m,30m,60m,90m,1h,1d,5d,1wk,1mo,3mo
        # (optional, default is '1d')
        interval = "15m",

        # group by ticker (to access via data['SPY'])
        # (optional, default is 'column')
        group_by = 'ticker',

        # adjust all OHLC automatically
        # (optional, default is False)
        auto_adjust = True,

        # download pre/post regular market hours data
        # (optional, default is False)
        prepost = True,

        # use threads for mass downloading? (True/False/Integer)
        # (optional, default is True)
        threads = True,

        # proxy URL scheme use use when downloading?
        # (optional, default is None)
        proxy = None
    )

[*********************100%***********************]  3 of 3 completed


In [12]:
data["MSFT"].reset_index().dtypes

Datetime    datetime64[ns, America/New_York]
Open                                 float64
High                                 float64
Low                                  float64
Close                                float64
Volume                                 int64
dtype: object

Let's formulate the problem in the following way:
- We want to predict the average stock price of the coming week
- Each sample in the training set is going to be a combination of Open Price

In [15]:
train_data_cutoff = datetime.fromisoformat('2022-01-10')
test_data_beginning = train_data_cutoff + timedelta(days=7)
test_data_end = test_data_beginning + timedelta(days=7)

In [28]:
train_data_cutoff)

datetime.datetime

In [33]:
data["MSFT"].index.argmax()

1668