In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import yfinance as yf
import pandas as pd
import logging
import sys
from copy import deepcopy
from pathlib import Path
from datetime import datetime
from dateutil.relativedelta import relativedelta

In [3]:
src_path: str = "../src"
sys.path.append(src_path)
logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)

In [4]:
from data.utils import download_yfinance_data, get_price_statistics
from data.plots import candlestick_yearly, violin_monthly, violin_weekday
from models.utils import split_time_data, preprocess_data, fit_forecaster
from models.plots import plot_data_split, plot_data_predictions

In [5]:
random_seed = 8080
portfolio_name = "big_tech"
data_path = Path("..").resolve().joinpath("data")
data_path

PosixPath('/home/uziel/Development/stock_picker/data')

## 1. Business understanding


It is often not straight-forward to discern which stocks have performed better than others when looking at historical data. Stock's price changes constantly, with some events, such as economic recessions, pandemics or natural disasters heavily affecting the valuations of many companies.

However, these companies are not necessarily affected the same way, and their publicly traded price can, due to a multitude of factors, react very differently to the same events. For example, in 2020 there was the COVID-19 outbreak, eCommerce companies such as Amazon, as well as hardware manufacturers such as NVIDIA, saw a dramatic increase in the demand of their products due to the pandemic's lockdowns. We will be able to see this clearly in the data later on.

As small investors, we might be interested in analysing the past performance of a certain portfolio of companies, as well as be able forecast whether they will continue their present trend.

The goal of this project is to provide an easy to use interface to quickly compare the performance of multiple companies over a given period of time by means of visualizations and aggregated statistics. This will allow users to pick which companies did best. It will also provide an idea of how the stocks will change their price in the near future. We aim to answer questions such as:

1. Which company grew more in value?
2. Which company experimented the lowest volatility? Which experimented the highest?
3. Which company is expected to grow more in value?

However, it is important to consider that, according to [random walk theory](https://en.wikipedia.org/wiki/Random_walk_hypothesis), market prices behave randomly and not as a function of their time series. In other words, the patterns observed in the historical price a certain financial security are unlikely to help predict the future. In financial markets, we find that the future value of a security depends on its previous value plus some unexplained variance.

**DISCLAIMER:** This project is merely meant to be used for understanding the past and get a sense of the future. The insights gained and any recommendations made **are not financial advise**. The value of a company at any given time and its evolution dependes on many factors that aren't taken into account in this project. Real-world value investing requires an in-depth analysis of each company and sector, and it's still not guaranteed to yield better returns than simply investing in a market index. And above all, **Past performance is no guarantee of future results. Don't assume an investment will continue to do well in the future simply because it's done well in the past.**


## 2. Understanding the data through Exploratory Data Analysis (EDA)

In this section, we preview the kind of financial data that can be downloaded through the Yahoo Finance API.


In [6]:
portfolio_filepath = data_path.joinpath("portfolios").joinpath(f"{portfolio_name}.txt")
tickers = [line.split(" ")[0] for line in portfolio_filepath.read_text().split("\n")]
date_range = (
    datetime.now() - relativedelta(years=5),
    datetime.now(),
)
save_path = data_path.joinpath(portfolio_name)
save_path.mkdir(parents=True, exist_ok=True)

In [7]:
ticker_selected = tickers[0]

In [8]:
tickers_info, tickers_data = download_yfinance_data(tickers, date_range, save_path)

INFO:root:Downloading information for tickers (GOOG AMZN AAPL MSFT NFLX)
2023-02-15 21:57:52.846 INFO    root: Downloading information for tickers (GOOG AMZN AAPL MSFT NFLX)




INFO:root:Tickers information downloaded successfully!
2023-02-15 21:58:14.483 INFO    root: Tickers information downloaded successfully!
INFO:root:Downloading historical data...
2023-02-15 21:58:14.485 INFO    root: Downloading historical data...


[*********************100%***********************]  5 of 5 completed

INFO:root:Historical data downloaded successfully!
2023-02-15 21:58:14.994 INFO    root: Historical data downloaded successfully!





### 2.1. Tickers information


In [9]:
tickers_info.dropna(how="all")

Unnamed: 0,GOOG,AMZN,AAPL,MSFT,NFLX
zip,94043,98109-5210,95014,98052-6399,95032
sector,Communication Services,Consumer Cyclical,Technology,Technology,Communication Services
fullTimeEmployees,190234,1541000,164000,221000,12800
longBusinessSummary,Alphabet Inc. offers various products and plat...,"Amazon.com, Inc. engages in the retail sale of...","Apple Inc. designs, manufactures, and markets ...","Microsoft Corporation develops, licenses, and ...","Netflix, Inc. provides entertainment services...."
city,Mountain View,Seattle,Cupertino,Redmond,Los Gatos
...,...,...,...,...,...
bidSize,800,800,1200,1000,900
preMarketPrice,94.8,99.13,153.12,268.3,355.44
logo_url,https://logo.clearbit.com/abc.xyz,https://logo.clearbit.com/amazon.com,https://logo.clearbit.com/apple.com,https://logo.clearbit.com/microsoft.com,https://logo.clearbit.com/netflix.com
trailingPegRatio,1.1571,6.471,2.4749,2.2853,1.6656


In [10]:
tickers_info.dropna(how="all").index

Index(['zip', 'sector', 'fullTimeEmployees', 'longBusinessSummary', 'city',
       'phone', 'state', 'country', 'companyOfficers', 'website', 'maxAge',
       'address1', 'industry', 'ebitdaMargins', 'profitMargins',
       'grossMargins', 'operatingCashflow', 'revenueGrowth',
       'operatingMargins', 'ebitda', 'targetLowPrice', 'recommendationKey',
       'grossProfits', 'freeCashflow', 'targetMedianPrice', 'earningsGrowth',
       'currentRatio', 'returnOnAssets', 'numberOfAnalystOpinions',
       'targetMeanPrice', 'debtToEquity', 'returnOnEquity', 'targetHighPrice',
       'totalCash', 'totalDebt', 'totalRevenue', 'totalCashPerShare',
       'financialCurrency', 'revenuePerShare', 'quickRatio',
       'recommendationMean', 'shortName', 'longName', 'isEsgPopulated',
       'gmtOffSetMilliseconds', 'messageBoardId', 'market',
       'enterpriseToRevenue', 'enterpriseToEbitda', 'forwardEps',
       'sharesOutstanding', 'bookValue', 'sharesShort',
       'sharesPercentSharesOut', 'la

Lots of current information is available for inspection for each of the security tickers in our portfolio. Among the many fields, we can find relevant financial indicators such as `ebitda`, `freeCashflow`, `revenuePerShare` and many, many others. Unfortunately, we don't have historical data for these fields, so we cannot link this information to our historical price data.

### 2.2. Historical Price Data


In [11]:
tickers_data

Unnamed: 0_level_0,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,Close,...,Open,Open,Open,Open,Open,Volume,Volume,Volume,Volume,Volume
Unnamed: 0_level_1,AAPL,AMZN,GOOG,MSFT,NFLX,AAPL,AMZN,GOOG,MSFT,NFLX,...,AAPL,AMZN,GOOG,MSFT,NFLX,AAPL,AMZN,GOOG,MSFT,NFLX
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2018-02-16,41.095284,72.434502,54.740002,87.044746,278.519989,43.107498,72.434502,54.740002,92.000000,278.519989,...,43.090000,72.868500,54.420502,92.449997,278.730011,160704400.0,89452000.0,33590000.0,30596900.0,8312400.0
2018-02-17,,,,,,,,,,,...,,,,,,,,,,
2018-02-18,,,,,,,,,,,...,,,,,,,,,,
2018-02-19,,,,,,,,,,,...,,,,,,,,,,
2018-02-20,40.957062,73.417503,55.123001,87.725960,278.549988,42.962502,73.417503,55.123001,92.720001,278.549988,...,43.012501,72.324501,54.528500,91.480003,277.739990,135722000.0,129984000.0,28462000.0,30911700.0,7769000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-02-11,,,,,,,,,,,...,,,,,,,,,,
2023-02-12,,,,,,,,,,,...,,,,,,,,,,
2023-02-13,153.850006,99.540001,95.000000,271.320007,358.570007,153.850006,99.540001,95.000000,271.320007,358.570007,...,150.949997,97.849998,95.010002,267.640015,349.500000,62199000.0,52841500.0,43116600.0,44630900.0,7134400.0
2023-02-14,153.199997,99.699997,94.949997,272.170013,359.959991,153.199997,99.699997,94.949997,272.170013,359.959991,...,152.119995,98.410004,94.660004,272.670013,357.549988,61707600.0,56202900.0,42513100.0,37047900.0,4624800.0


Different information is available for each date and ticker: `Adj Close`, `Close`, `High`, `Low`, `Open` and `Volume`. We will only be using `Adj Close` for performance analysis as well as forecasting.

`NaN` value indicate periods for which no data is available. In the case of stock securities of publicly traded companies, we don't have values for Saturdays and Sundays, since the stock markets are closed. We will later on remove beginning and trailing `NaN` values as needed, while `NaN` values found in the middle of periods, such as weekends, will be forward filled.


## 3. Data Preparation and Visualization

In the previous section, we already did some minor pre-processing such as ensuring that the index of the price data is a `datetime` object. Next, we will extract price statistics as well visualizations to improve our understanding of the data.


### 3.1. Historical price statistics


In [12]:
price_stats = get_price_statistics(tickers_data)
price_stats

INFO:root:Calculating historical price statistics...
2023-02-15 21:58:15.252 INFO    root: Calculating historical price statistics...
INFO:root:Historical price statistics calculated successfully!
2023-02-15 21:58:19.026 INFO    root: Historical price statistics calculated successfully!


Unnamed: 0,count,mean,std,min,25%,50%,75%,max,abs_change,rel_change,max_fall,max_rise
AAPL,1258.0,99.2,46.24,34.26,50.45,108.73,143.02,180.68,113.92,277.22,-0.39,4.27
MSFT,1258.0,194.21,74.13,82.48,122.37,203.47,255.02,339.92,181.75,208.8,-0.37,3.12
GOOG,1258.0,87.0,30.91,48.81,59.22,75.78,113.34,150.71,42.23,77.16,-0.45,2.09
AMZN,1258.0,120.59,34.79,67.2,89.62,108.22,158.09,186.57,28.57,39.44,-0.56,1.78
NFLX,1258.0,390.37,117.31,166.37,307.09,362.87,494.56,691.69,82.31,29.55,-0.76,1.96


*According to these statistics, we can see that **GOOG** experimented the least price volatily in the observed period (`std=30.91`), while **NFLX** experimented the largest (`std=117.31`).

On the other hand, **APPL** increased its value the most since the beginning of the observed period (`rel_change=277.76`), while **NTLX** increased the least (`rel_change=29.86`).

And Finally, **MSFT** experimented the lowest fall in price (`max_fall=-0.37`), while **NFLX** saw its price plummet `76%` at some point during this period.

Overall, it seems that this wasn't a good period to be invested in **NFLX**!

**Clarification on custom statistics**:
- *max_fall*: The maximum price fall from a previous all-time high.
- *max_rise*: The maximum price rise from a previous all-time low.

### 3.2. Visualizing yearly price movements through candlestick charts

A [candlestick chart](https://en.wikipedia.org/wiki/Candlestick_chart) is a style of financial chart used to describe price movements of a security, derivative, or currency. It is similar to a bar chart in that each candlestick represents all four important pieces of information for that day: open and close in the thick body; high and low in the “candle wick”. Being densely packed with information, it tends to represent trading patterns over short periods of time, often a few days or a few trading sessions.


In [13]:
fig = candlestick_yearly(tickers_data, ticker_selected)
fig.update_layout(template="seaborn").show()

We can see that for **GOOG**, there was a positive trend in price from 2019 to 2021, and then the price plummeted in 2022, with some slight recovery in 2023 so far.

### 3.3. Visualizing price seasonality

[Seasonality](https://www.investopedia.com/terms/s/seasonality.asp) is a characteristic of a time series in which the data experiences regular and predictable changes that recur every calendar year. Any predictable fluctuation or pattern that recurs or repeats over a one-year period is said to be seasonal.


In [14]:
fig = violin_monthly(tickers_data, ticker_selected)
fig.update_layout(template="seaborn").show()

It seems that **GOOG** price tends to be lower during the summer months compared to the rest of the year.

In [15]:
fig = violin_weekday(tickers_data, ticker_selected)
fig.update_layout(template="seaborn").show()

There is no noticeable difference in the price of **GOOG** depending on the day of the week.

## 4. Data Modelling

In this section, we will model our time-series data for price forecasting.


### 4.1. Split data into training and testing sets

Data is split into training and testing sets for model fitting and evaluation, respectively.

In [16]:
train_data, test_data = split_time_data(
    preprocess_data(
        tickers_data.reorder_levels(order=[1, 0], axis=1)[ticker_selected],
    ),
    start_train_date=tickers_data.index.max() - relativedelta(years=1),
)

In [17]:
fig = plot_data_split(train_data, test_data)
fig.update_layout(template="seaborn").show()

### 4.2. Model selection and hyper-parameter tuning

By default, a random forest autoregressor is trained to make predictions into the future. Its hyper-parameters are tuned using a special case of Grid Search with backtesting.

Mean Squared Error (MSE), a common regression evaluation metric, is used in this case to assess performance.

In [23]:
_, results_grid, pred, error_mse = fit_forecaster(
    train_data, test_data, exog_cols=["day", "month"]
)
results_grid.head()

Number of models compared: 72.


loop lags_grid: 100%|███████████████████████████████████████| 6/6 [00:12<00:00,  2.13s/it]

`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14] 
  Parameters: {'max_depth': 11, 'n_estimators': 100}
  Backtesting metric: 26.360296031213775






Unnamed: 0,lags,params,mean_squared_error,max_depth,n_estimators
58,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]","{'max_depth': 11, 'n_estimators': 100}",26.360296,11.0,100.0
49,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]","{'max_depth': None, 'n_estimators': 100}",26.360296,,100.0
55,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]","{'max_depth': 7, 'n_estimators': 100}",26.360296,7.0,100.0
52,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]","{'max_depth': 3, 'n_estimators': 100}",28.255902,3.0,100.0
54,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]","{'max_depth': 7, 'n_estimators': 50}",29.158522,7.0,50.0


The lowest mean squared error was achieved with `lags=14` and the following hyper-parameters:

`{'max_depth': 11, 'n_estimators': 100}`

### 4.3. Model Evaluation

In [24]:
fig = plot_data_predictions(test_data, pred)
fig.update_layout(template="seaborn").show()

In [25]:
print(f"MSE: {error_mse}")

MSE: 109.81174377559408


The predicted price was very far from the actual trend, suggesting that more effort is needed in order to achieve a better performance. Improvements include using more powerful models, such as **XGBoost**, more exhaustive hyper-parameter tuning, as well as a more careful selection of training and testing data. The latter might prove crucial due to the great variability experimented in time.

In any case, and as mentioned at the beginning of this notebook, security price is hardly dependent on past performance. Instead, it is driven by current events from varying sources, from sentiments expressed in social media to geopolotical changes. Including this information as exogenous variables would greatly increase our ability to predict future prices.