# Data Retrieval

This notebook demonstrates our data retrieval process. We cover the main constructs and functionality of our `retrieval` module under the package `data_pipeline` package.

## The `DataBank` Construct

In [2]:
from ..data_pipeline.retrieval import DataBank, download_adj_close

ImportError: attempted relative import with no known parent package

Our `DataBank` collects all the tickers from the Wikipedia article on the list of S&P 500 companies. It has methods for downloading all the tickers, as well as organizing the tickers by GCIS Sector and GCIS Sub-Industry. Let's initialize an instance of our `DataBank` and collect tickers. (As a sanity check, we compute the length of the list of tickers we obtain.)

In [None]:
data_bank = DataBank()
tickers = data_bank.get_tickers()

len(tickers)

In [None]:
tickers[:5]

## Tickers by Sector and Sub-Industry

The tickers can also be obtained by sector and sub-industry. In particular, the `DataBank` object has methods for constructing ticker-to-sector maps and ticker-to-subindustry maps. Here, the terms 'sector' and 'sub-industry' refer to the [GICS](https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard) sectors and sub-industries. 

These methods are particularly useful when one would like to cluster particular tickers in accordance with their GICS classifications.

In [None]:
data_bank.get_sectors_list()[:5]

In [None]:
data_bank.get_subind_list()[:5]

In [None]:
from itertools import islice

In [None]:
dict(islice(data_bank.ticker_to_sector_map().items(), 5))

In [None]:
dict(islice(data_bank.ticker_to_subind_map().items(), 5))

## Downloading Data

We download our data from [Yahoo! Finance](https://finance.yahoo.com/) API using the [`yfinance` library](https://github.com/ranaroussi/yfinance). Our primary data of interest will be the adjusted closing prices for the various tickers considered. 

We constructed functions for downloading historical data that are built on top of `yfinance`. We specifiy:
* A list of tickers,
* A start date and an end date (instead of a period),
* An interval (e.g. 1 month for monthly data vs 1 day for daily data, etc),
and obtain a dataframe whose columns are the tickers, and whose index is the date.

We have monthly data and daily data for two years, ending in end of November 2023, saved in cold storage as pickle files in the local folder [`data`](../../data/) in this repository.

In this notebook, we demonstrate the functionality with 3 days of data.

In [None]:
start = '2023-28-11'
end = TODAY

In [None]:
adj_closing_prices = download_adj_close(tickers, start, save_data=False)

In [None]:
adj_closing_prices

# Data Processing

Our data processing pipeline is simple:
* We take the adjusted close price data previously obtained as a dataframe from our retrieval process.
* We feed this dataframe to a `ClusterInput` class as an attribute, along with an optional second attribute `transform`. This second attribute is a function that will transform the multivariate time series into a multivariate time series that is more suitable for clustering purposes. For instance, we could normalize the time series using the $\ell_2$-norm. We would also take the series of returns as our clustering input. Alternatively, we could take the return of return series.
* The resulting `ClusterInput` object then has a `df` attribute representing the transposed transformed time series as a dataframe. The latter can readily be used as an input (features) for various standard clustering models that are part of standard libraries such as `scikit-learn` and `tslearn`.

Note that this input can be further by processing by the standard `preprocessing` classes of the aforementioned libraries. 

In what follows, we offer two examples: one using `KMeans` from `scikit-learn` and one using `TimeSeriesKMeans` from `tslearn`.

## Clustering Input Generation

In [None]:
from data_pipeline.processing import ClusterInput

In [None]:
clustering_input = ClusterInput(adj_closing_prices).df
clustering_input

## Further Preprocessing (Optional)

In [None]:
from sklearn.preprocessing import RobustScaler

features_sklearn = RobustScaler().fit_transform(clustering_input)
features_sklearn

In [None]:
from tslearn.preprocessing import TimeSeriesScalerMinMax

features_tslearn = TimeSeriesScalerMinMax().fit_transform(clustering_input)
features_tslearn

## Feeding Our `ClusterInput` To Models From Standard Libraries

We demonstrate this with preprocessing (on `features_<library_name>`) and without preprocessing (on `cluster_input`).

In [None]:
from sklearn.cluster import KMeans

KMeans().fit(features_sklearn).labels_


In [None]:
KMeans().fit(clustering_input).labels_

In [None]:
from tslearn.clustering import TimeSeriesKMeans

TimeSeriesKMeans().fit(features_tslearn).labels_

In [None]:
TimeSeriesKMeans().fit(clustering_input).labels_

In [None]:
import hyperopt
