# Download and Store Data

This notebook contains information on downloading the Yahoo Finance stock prices and a few other sources that we  would use for our automated trading system.

Let's start with installing a few libraries that we are going to use:

In [1]:
%pip install yfinance
%pip install tables

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import yfinance as yf
import pandas as pd
import numpy as np
from pathlib import Path
pd.set_option('display.expand_frame_repr', False)

## Set Data Store path

Currently we will keep our data in the /data folder in which the current file resides. Modify path if you would like to store the data elsewhere and change the notebooks accordingly.

In [3]:
DATA_STORE = Path('assets.h5')

For the purpose of the paper, we have chosen a deversified portfolio of stocks - choosing different industries and countries from the top 100 tickers in the New York Stock Exchange. Here is a little information about the tickers we are going to work with:

|Country | Corporation | Ticker | Industry |
| --- | --- | ---| --- |
| United States | 3M | MMM | Industrial Goods |
| China | Alibaba Group | BABA | Online Retail |
| United States | Amazon | AMZN | Technology |
| United States | Apple | AAPL | Technology |
| Australia | BHP Group LTD | BHP | Metals |
| United States | The Boeing Company | BA | Aerospace | 
| United Kingdom | BP p.l.c. | BP | Oil & Gas |
| Sweden | Ericsson | ERIC | Telecommunications |
| United States | Okta, Inc. | OKTA | CyberSecurity |
| Japan | Toyota | TM | Automobiles & Parts |
| Taiwan | Taiwan Semiconductor | TSM | Technology |
| France | Sanofi | SNY | Pharmaceutical | 
| United States | Coca-Cola Consolidated, Inc. | COKE | Food & Beverage |
| Netherlands | Aegon N.V. | AEG | Insurance |
| United States | General Electric | GE | Conglomerate |
| United States | Walmart Inc. | WMT | Retail |
| United States | General Motors Company | GM | Automotive Manufacturing | 
| United States | Intel Corporation | INTC | Microprocessors Technology |
| Japan | Canon Inc. | CAJ | Optical Imaging and more |
| United States | Coinbase Global, Inc. | COIN | Crypto Exchange |

In [23]:
portfolio_tickers = ['MMM', 'BABA', 'AMZN', 'AAPL', 'BHP', 
                     'BA', 'BP', 'INTC', 'OKTA', 'TM', 
                     'TSM', 'SNY', 'COKE', 'AEG', 'GE', 
                     'WMT', 'GM', 'ERIC', 'CAJ', 'COIN', 'MSFT']

In [31]:
import yfinance as yt
import pandas as pd

tickers_dict = dict()
for ticker in portfolio_tickers:
    ticker_data = yt.Ticker(ticker)
    
    tickers_dict[ticker] = ticker_data 
    temp_prices = ticker_data.history(period="max")
    temp_prices = temp_prices.reset_index()
    temp_prices['Ticker'] = ticker
    temp_prices.to_csv('data/stock_prices.csv')

In [32]:
df_prices = (pd.read_csv('data/stock_prices.csv',
                 parse_dates=['Date'],
                 index_col=['Date', 'Ticker'],
                 infer_datetime_format=True)
     .sort_index())

print(df_prices.info(null_counts=True))
with pd.HDFStore(DATA_STORE) as store:
    store.put('yfinance/stock/prices', df_prices)

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9286 entries, (datetime.datetime(1986, 3, 13, 0, 0, tzinfo=tzoffset(None, -18000)), 'MSFT') to (datetime.datetime(2023, 1, 13, 0, 0, tzinfo=tzoffset(None, -18000)), 'MSFT')
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    9286 non-null   int64  
 1   Open          9286 non-null   float64
 2   High          9286 non-null   float64
 3   Low           9286 non-null   float64
 4   Close         9286 non-null   float64
 5   Volume        9286 non-null   int64  
 6   Dividends     9286 non-null   float64
 7   Stock Splits  9286 non-null   float64
dtypes: float64(6), int64(2)
memory usage: 938.4+ KB
None


  print(df_prices.info(null_counts=True))
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->datetime,key->axis1_level0] [items->None]

  store.put('yfinance/stock/prices', df_prices)


## Loading Stock Meta Data

In [33]:
df_metadata = pd.read_csv('data/us_equities_meta_data.csv')
df_metadata = df_metadata.loc[df_metadata['ticker'].isin(portfolio_tickers)]
df_metadata

Unnamed: 0,ticker,name,lastsale,marketcap,ipoyear,sector,industry
135,AMZN,"Amazon.com, Inc.",1693.96,821950000000.0,1997.0,Consumer Services,Catalog/Specialty Distribution
191,AAPL,Apple Inc.,183.92,903990000000.0,1980.0,Technology,Computer Manufacturing
665,COKE,Coca-Cola Bottling Co. Consolidated,133.08,1240000000.0,1972.0,Consumer Non-Durables,Beverages (Production/Distribution)
977,ERIC,Ericsson,7.62,25050000000.0,,Technology,Radio And Television Broadcasting And Communic...
1583,INTC,Intel Corporation,49.47,230530000000.0,,Technology,Semiconductors
2048,MSFT,Microsoft Corporation,99.05,761020000000.0,1986.0,Technology,Computer Software: Prepackaged Software
2279,OKTA,"Okta, Inc.",50.82,5420000000.0,2017.0,Technology,Computer Software: Prepackaged Software
3690,MMM,3M Company,195.83,116260000000.0,,Health Care,Medical/Dental Instruments
3733,AEG,Aegon NV,5.86,11980000000.0,,Finance,Life Insurance
3764,BABA,Alibaba Group Holding Limited,184.75,474040000000.0,2014.0,Miscellaneous,Business Services


In [34]:
with pd.HDFStore(DATA_STORE) as store:
    store.put('us_equities/stocks', df_metadata.set_index('ticker'))