<p style='color:DarkRed; text-align:center; text-transform:uppercase; font-size:30px'>Project 7 : Trading Data with Yahoo Finance API</p>

Project 7 : (Trading Data) Collect trading data using Yahoo finance API and use online regression (from river) to predict markets stocks of CAC40, S&P500, Google, Facebook & Amazon or any others enterprise.

Option 1: Use global index data streams of each of the 3 regions https://www.bloomberg.com/markets/stocks
For example for US, S&P500, for EU CAC40, 

Option 2 : For each of these 5 countries, use 1 major industry stock data
For ex, in US Google,  in France BNP Paribas, in China Alibaba, in Russia or England, use a major international industry.
This option was initially given in the project.

Option 3 : Stock Market for cryptos such as Bitcoin stock or other Binance stock (https://www.binance.com/en/landing/data) or Currency Data evolution (EUR/USD, USD/RUB)

For each option, each group should use at least 3 different data streams, with online and adaptive regression on RIVER and compare the performances with batch regression model (scikit-learn).

To Do: Compare online Regression vs Batch Regression / Time Series forecasting and discuss the performance.

Bonus: Use recent stock market data in 2022 .

Online resources: 
You can use the Python library to collect Yahoo Finance data in streaming https://pypi.org/project/yfinance/
You can compute time-series statistics and moving averages (MACD) for features engineering https://www.statsmodels.org/stable/tsa.html


Online resources: Use Alpha Vantage
https://www.alphavantage.co/support/#:~:text=Are%20there%20usage%2Ffrequency%20limits,volume%2C%20please%20 visit%20premium%20 membership.

Binance API Python package https://algotrading101.com/learn/binance-python-api-guide/#:~:text=The%20Binance%20API%20is%20a,to%20send%20and%20receive%20data.

BackTrader : https://www.backtrader.com/home/features/


<p style='color:Orange; text-align:center; text-transform:uppercase; font-size:25px'>First test of yfinance library</p>


In [2]:
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import yfinance as yf
import seaborn as sns
import pandas as pd
import numpy as np
import river

In [3]:
#from andresberejnoi/PublicNotebooks

def print_info(yf_tickers):
    if isinstance(yf_tickers,yf.Ticker):
        print(f"\n{'='*80}")
        print(f"{' '*33}{yf_tickers.info['symbol']}\n")
        for key in yf_tickers.info:
            print(f"--> {key:>29} : {yf_tickers.info[key]}")
            
    elif isinstance(yf_tickers,yf.Tickers):
        for ticker in yf_tickers.tickers:
            print(ticker)
            print(f"\n{'='*80}")
            print(f"{' '*33}{ticker.info['symbol']}\n")
            for key in ticker.info.keys():
                print(f"--> {key:>29} : {ticker.info[key]}")
                
def print_table(yf_tickers):
    if isinstance(yf_tickers,yf.Ticker):
        ticker = yf_tickers
        print(f"| {ticker.info.get('symbol','NONE'):<5} | {ticker.info.get('sector','NONE'):>25} | " + \
              f"{ticker.info.get('currency','NONE'):>4} | {ticker.info.get('quoteType','NONE'):>6} | " + \
              f"{ticker.info.get('shortName','NONE'):<35} |")

    elif isinstance(yf_tickers,yf.Tickers):
        for ticker in yf_tickers.tickers:
            print(f"| {ticker.info.get('symbol','NONE'):<5} | {ticker.info.get('sector','NONE'):>25} | " + \
                  f"{ticker.info.get('currency','NONE'):>4} | {ticker.info.get('quoteType','NONE'):>6} | " + \
                  f"{ticker.info.get('shortName','NONE'):<35} |")

In [9]:
# One Ticker
aapl = yf.Ticker('aapl')

# Multiple Tickers
list_tickers=['goog','aapl','baba','bnp.pa','ing', 'nvs']
tickers = yf.Tickers(list_tickers)

df = tickers.download(period='2y', group_by='ticker')


# get historical market data
hist = aapl.history(period="max")

[*********************100%***********************]  6 of 6 completed


In [7]:
# show actions (dividends, splits)
aapl.cashflow
#aapl.dividends
#aapl.splits
#aapl.shares

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


**Bollinger Bands** (/ˈbɒlɪndʒər/) are a type of statistical chart characterizing the prices and volatility over time of a financial instrument or commodity.

Bollinger Bands consist of an N-period moving average (MA), an upper band at K times an N-period standard deviation above the moving average (MA + Kσ), and a lower band at K times an N-period standard deviation below the moving average (MA − Kσ). The chart thus expresses arbitrary choices or assumptions of the user, and is not strictly about the price data alone.

Usually K is 1.96, the 97.5th percentile point of the standard normal distribution as $P(|X|<1,96) \approx 0.95 $, for $X \sim N(0, 1)$ (TCL : 95% confidence intervals ->meaning that 95% of the area under a normal curve lies within approximately 1.96 standard deviations of the mean)
<details>
<summary>Financial use</summary>
The use of Bollinger Bands varies widely among traders. Some traders buy when price touches the lower Bollinger Band and exit when price touches the moving average in the center of the bands. Other traders buy when price breaks above the upper Bollinger Band or sell when price falls below the lower Bollinger Band.[4] Moreover, the use of Bollinger Bands is not confined to stock traders; options traders, most notably implied volatility traders, often sell options when Bollinger Bands are historically far apart or buy options when the Bollinger Bands are historically close together, in both instances, expecting volatility to revert towards the average historical volatility level for the stock.
</details>

In [51]:
# list_tickers=['dow','aapl','msft','bnp.pa','voo']
# tickers = yf.Tickers(list_tickers)
# df = tickers.download(period='2y', group_by='ticker')
df_bnp_pa = df['BNP.PA'].copy()

df_bnp_pa['Middle Band'] = df_bnp_pa['Close'].rolling(window=20).mean()
df_bnp_pa['Upper Band'] = df_bnp_pa['Middle Band'] + 1.96*df_bnp_pa['Close'].rolling(window=20).std()
df_bnp_pa['Lower Band'] = df_bnp_pa['Middle Band'] - 1.96*df_bnp_pa['Close'].rolling(window=20).std()

fig = go.Figure()

# Set up traces
fig.add_trace(go.Scatter(x=df_bnp_pa.index, y=df_bnp_pa['Middle Band'], line=dict(color='blue', width=.7), name='Middle Band'))
fig.add_trace(go.Scatter(x=df_bnp_pa.index, y=df_bnp_pa['Upper Band'], line=dict(color='red', width=1.5), name='Upper Band (sell)'))
fig.add_trace(go.Scatter(x=df_bnp_pa.index, y=df_bnp_pa['Lower Band'], line=dict(color='green', width=1.5), name='Lower Band (buy)'))

fig.add_trace(go.Candlestick(x=df_bnp_pa.index,
                            open=df_bnp_pa['Open'],
                            high=df_bnp_pa['High'],
                            low=df_bnp_pa['Low'],
                            close=df_bnp_pa['Close'], name='market data'))

# Title
fig.update_layout(
    title='Bollinger Band Strategy',
    yaxis_title='BNP Paribas Engineering Stock Price (USD per shares)'
)
fig.show()


In [20]:
df_bnp_pa

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,Middle Band,Upper Band,Lower Band
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2020-12-16,40.005756,40.258901,39.033146,39.237438,4048651.0,0.0,0.0,,,
2020-12-17,39.397319,39.606053,38.815531,39.121967,3015518.0,0.0,0.0,,,
2020-12-18,38.922117,39.708200,38.686737,38.695618,7713150.0,0.0,0.0,,,
2020-12-21,36.985781,37.625306,36.266319,37.052399,5402160.0,0.0,0.0,,,
2020-12-22,37.216720,37.971712,37.216720,37.736332,2705483.0,0.0,0.0,,,
...,...,...,...,...,...,...,...,...,...,...
2022-12-12,52.790001,53.040001,52.450001,52.680000,1756729.0,0.0,0.0,52.7135,53.624943,51.802057
2022-12-13,53.029999,54.080002,52.700001,53.630001,3030839.0,0.0,0.0,52.7715,53.758906,51.784094
2022-12-14,53.770000,54.090000,53.439999,53.509998,2177111.0,0.0,0.0,52.8425,53.827868,51.857132
2022-12-15,53.279999,53.450001,51.590000,51.669998,4148176.0,0.0,0.0,52.8005,53.904776,51.696224


<p style='color:Orange; text-align:center; text-transform:uppercase; font-size:25px'>Batch Regression for time series forecasting</p>


In [12]:
list_tickers=['goog','aapl','baba','bnp.pa','ing', 'nvs']
tickers = yf.Tickers(list_tickers)
df = tickers.download(period='2y', group_by='ticker')
df.head()

[*********************100%***********************]  6 of 6 completed


Unnamed: 0_level_0,BABA,BABA,BABA,BABA,BABA,BABA,BABA,NVS,NVS,NVS,...,BNP.PA,BNP.PA,BNP.PA,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL,AAPL
Unnamed: 0_level_1,Open,High,Low,Close,Volume,Dividends,Stock Splits,Open,High,Low,...,Volume,Dividends,Stock Splits,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2020-12-29,231.759995,239.190002,229.600006,236.259995,69715900.0,0.0,0.0,85.689972,86.446657,85.588465,...,1623551.0,0.0,0.0,136.412443,137.143656,132.746445,133.270157,121047300.0,0.0,0.0
2020-12-30,243.348007,243.389999,234.645004,238.389999,44812300.0,0.0,0.0,86.861906,87.701645,86.714264,...,1374724.0,0.0,0.0,133.971772,134.376913,131.817623,132.133835,96452100.0,0.0,0.0
2020-12-31,237.460007,238.919998,231.026993,232.729996,23451800.0,0.0,0.0,86.908051,87.231025,86.151367,...,827264.0,0.0,0.0,132.48957,133.141745,130.157563,131.116058,99116600.0,0.0,0.0
2021-01-04,226.5,230.380005,225.039993,227.850006,24897900.0,0.0,0.0,87.673965,87.766242,86.492799,...,3025708.0,0.0,0.0,131.936187,132.025115,125.256372,127.874939,143301900.0,0.0,0.0
2021-01-05,229.050003,240.759995,228.119995,240.399994,35823800.0,0.0,0.0,87.286388,87.563227,86.474338,...,2852830.0,0.0,0.0,127.361115,130.177315,126.906565,129.455963,97664900.0,0.0,0.0


<p style='color:yellow; text-align:center; text-transform:uppercase; font-size:20px'>Neural network : LSTM</p>


In [60]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from torch.utils.data import Dataset, DataLoader


In [59]:
def Scale(train, test):
    """
    Scale Data
    """
    scaler = MinMaxScaler()
    train = np.array(train)
    test = np.array(test)
    train = scaler.fit_transform(train.reshape(train.shape[0], -1))
    test = scaler.transform(test.reshape(test.shape[0], -1))
    return scaler, train, test

In [None]:
class TimeDataset(Dataset):
    def __init__(self, data, transform=None, window=50):
        self.data = torch.Tensor(data)
        self.window = window
        self.transform = transform
        if self.data.size(0) % self.window_size != 0:
                zeros = torch.zeros(abs(self.window_size - (self.data.size(0) % self.window_size)), self.data.size(1)).double()
                self.data = torch.cat((self.data, zeros), axis=0)
        

    def __getitem__(self, index):
        x = self.data[index:index+self.window]
        y = self.data[index+self.window]

        if self.transform:
            x = self.transform(x)
        return {'data': x, 'label': y}

    def __len__(self):
        return len(self.data) - self.window

In [62]:
# Create datasets 67 136 nan
from sklearn.model_selection import train_test_split
def load_data(comp, selected_price='Close'):
    df_comp = df[comp].copy()
    imputer = KNNImputer(n_neighbors=4)
    #df_bnp_pa['Close'].rolling(window=20).mean()
    df_comp = pd.DataFrame(imputer.fit_transform(df_comp), columns=df_comp.columns)
    X_train, X_test, y_train, y_test = train_test_split(df_comp[selected_price], y, test_size=0.2, random_state=1)

    _, X_train, X_test = Scale(X_train, X_test)
    return comp, X_train, X_test

load_data('BNP.PA')
    

<bound method IndexOpsMixin.to_numpy of 0      38.766674
1      38.464676
2      38.287037
3      38.202652
4      38.122711
         ...    
514    53.810001
515    53.799999
516    54.000000
517    53.560001
518    53.410000
Name: Close, Length: 519, dtype: float64>


UnboundLocalError: local variable 'X_train' referenced before assignment

In [None]:
batch_size = 10
comp = 'BNP.PA'

train, val = load_data(comp)
training_set = TimeDataset(train)
validation_set = TimeDataset(val)
training_loader = DataLoader(training_set, batch_size=batch_size, shuffle=False, num_workers=0)
validation_loader = DataLoader(validation_set, batch_size=batch_size, shuffle=False, num_workers=0)

In [16]:
# Model LSTM

input_dim = 1
hidden_dim = 32
num_layers = 2
out_dim = 1

class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, out_dim, dropout=0.2):
        super(LSTM, self).__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        self.dropout = nn.Dropout(dropout)
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first = True)
        self.fc = nn.Linear(hidden_dim, out_dim)

    def forward(self, x):
        # cn and hn size : num_layers, batch, hidden dim
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).requires_grad()
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).requires_grad()
        out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
        
        # extract the last hidden state
        print(out.size(), hn.size(), cn.size())
        out = self.fc(out[:, -1, :])

        return out


In [23]:
model = LSTM(input_dim=input_dim, hidden_dim=hidden_dim, num_layers=num_layers, out_dim=out_dim, dropout=0.2)
optimiser = torch.optim.Adam(model.parameters(), lr=0.01)
loss_fn = torch.nn.MSELoss()
print(model)

LSTM(
  (dropout): Dropout(p=0.2, inplace=False)
  (lstm): LSTM(1, 32, num_layers=2, batch_first=True)
  (fc): Linear(in_features=32, out_features=1, bias=True)
)


In [None]:
### TRAINING ###

def train_one_epoch(epoch_index, tb_writer):
    running_loss = 0.
    last_loss = 0.

    # Here, we use enumerate(training_loader) instead of
    # iter(training_loader) so that we can track the batch
    # index and do some intra-epoch reporting
    for i, data in enumerate(training_loader):
        # Every data instance is an input + label pair
        inputs, labels = data

        # Zero your gradients for every batch!
        optimizer.zero_grad()

        # Make predictions for this batch
        outputs = model(inputs)

        # Compute the loss and its gradients
        loss = loss_fn(outputs, labels)
        loss.backward()

        # Adjust learning weights
        optimizer.step()

        # Gather data and report
        running_loss += loss.item()
        if i % 1000 == 999:
            last_loss = running_loss / 1000 # loss per batch
            print('  batch {} loss: {}'.format(i + 1, last_loss))
            tb_x = epoch_index * len(training_loader) + i + 1
            tb_writer.add_scalar('Loss/train', last_loss, tb_x)
            running_loss = 0.

    return last_loss

In [None]:
# Initializing in a separate cell so we can easily add more epochs to the same run
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
writer = SummaryWriter('runs/fashion_trainer_{}'.format(timestamp))
epoch_number = 0

# Create data loaders for our datasets; shuffle for training, not for validation
training_loader = torch.utils.data.DataLoader(training_set, batch_size=4, shuffle=True, num_workers=2)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=4, shuffle=False, num_workers=2)

EPOCHS = 5

best_vloss = 1_000_000.

for epoch in range(EPOCHS):
    print('EPOCH {}:'.format(epoch_number + 1))

    # Make sure gradient tracking is on, and do a pass over the data
    model.train(True)
    avg_loss = train_one_epoch(epoch_number, writer)

    # We don't need gradients on to do reporting
    model.train(False)

    running_vloss = 0.0
    for i, vdata in enumerate(validation_loader):
        vinputs, vlabels = vdata
        voutputs = model(vinputs)
        vloss = loss_fn(voutputs, vlabels)
        running_vloss += vloss

    avg_vloss = running_vloss / (i + 1)
    print('LOSS train {} valid {}'.format(avg_loss, avg_vloss))

    # Log the running loss averaged per batch
    # for both training and validation
    writer.add_scalars('Training vs. Validation Loss',
                    { 'Training' : avg_loss, 'Validation' : avg_vloss },
                    epoch_number + 1)
    writer.flush()

    # Track best performance, and save the model's state
    if avg_vloss < best_vloss:
        best_vloss = avg_vloss
        model_path = 'model_{}_{}'.format(timestamp, epoch_number)
        torch.save(model.state_dict(), model_path)

    epoch_number += 1

<p style='color:yellow; text-align:center; text-transform:uppercase; font-size:20px'>Autoregressive model : ARIMA</p>


In [None]:
from statsmodels.tsa.stattools import acf, pacf
from scipy.ndimage.interpolation import shift
from statsmodels.tsa.arima_model import ARIMA