# Experiments with Machine Learning

**APPROACH:** Predict the price using Machine Learning models, then decide to go long or short.

First, import necessary libraries

In [1]:
import pandas as pd 
import yfinance as yf
import math
import numpy as np
# from tensorflow.keras import Sequential
# from tensorflow.keras.layers import LSTM, Dense
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import f1_score, accuracy_score
from backtesting import Backtest
from business_logic.decision_making.data_prepration import get_OHLC_df, label_OHLC_df, split_train_test, prepare_data_train_model
from business_logic.decision_making.strategies.BinaryClassificationStrategy import BinaryClassificationStrategy
from business_logic.models.portfolio import Portfolio
from business_logic.models.stock import Stock
from enums import Position
from business_logic.decision_making.strategies.strategy_testing import test_strategy

## Prepare data
3-year data from 2017-2019, and test with data of 2020 and the first half of 2021

In [2]:
aapl = yf.Ticker('AAPL')
orig_data = aapl.history(start='2018-04-02', end='2021-03-31') 
orig_data.shape

(755, 7)

In [3]:
orig_data.index

DatetimeIndex(['2018-04-02', '2018-04-03', '2018-04-04', '2018-04-05',
               '2018-04-06', '2018-04-09', '2018-04-10', '2018-04-11',
               '2018-04-12', '2018-04-13',
               ...
               '2021-03-17', '2021-03-18', '2021-03-19', '2021-03-22',
               '2021-03-23', '2021-03-24', '2021-03-25', '2021-03-26',
               '2021-03-29', '2021-03-30'],
              dtype='datetime64[ns]', name='Date', length=755, freq=None)

As can be seen from above, the data fetched from Yahoo Finance is a Dataframe, indexed and sorted by date, which is very convenient. The next step is to split the data into train and test set:

In [4]:
data = get_OHLC_df(orig_data)
data = label_OHLC_df(data, 2)
data

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-04-04,39.768909,41.488656,39.742377,1.0,138422000
2018-04-05,41.626139,42.024116,41.505539,1.0,107732800
2018-04-06,41.237811,41.602020,40.569688,-1.0,140021200
2018-04-09,40.974904,41.749151,40.967668,-1.0,116070800
2018-04-10,41.727450,41.968650,41.372887,1.0,113634400
...,...,...,...,...,...
2021-03-24,122.820000,122.900002,120.070000,-1.0,88530500
2021-03-25,119.540001,121.660004,119.000000,-1.0,98844700
2021-03-26,120.349998,121.480003,118.919998,1.0,93958900
2021-03-29,121.650002,122.580002,120.730003,1.0,80819200


In [5]:
data.Close.unique()

array([ 1., -1.])

In [6]:
split_date = np.datetime64('2020-03-31')
X_train, X_test, y_train, y_test = split_train_test(data, split_date)

## Build models and test their performance

First, I will create a Random Forest Classifier to predict if the price will go up or down. My strategy will then decide to go long or short accordingly. For experimenting, I dedcided to create a classifier with default values.

In [7]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
f1_score(y_test, y_pred)

0.6149068322981368

In [8]:
accuracy_score(y_test, y_pred)

0.5079365079365079

In [9]:
class TwoClassRandomForestStrategy(BinaryClassificationStrategy):
    price_delta = .004

    def init(self):
        self.clf = rfc

    def next(self):
        row = self.data.df.iloc[-1:]
        X = row[['Open', 'High', 'Low', 'Volume']]
        pred = self.clf.predict(X)[0]

        self.decide_trade(pred)

        # if position has been hold for more than 2 days => set stop-loss to be more aggressive
        current_time = self.data.index[-1]
        high, low = self.data.High, self.data.Low
        for trade in self.trades:
            if current_time - trade.entry_time > pd.Timedelta('2 days'):
                if trade.is_long:
                    trade.sl = max(trade.sl, low)
                else:
                    trade.sl = min(trade.sl, high)

In [10]:
test_data = orig_data[orig_data.index > split_date]
bt = Backtest(test_data, TwoClassRandomForestStrategy, commission=.0002, margin=.05)
bt.run()

Start                     2020-04-01 00:00:00
End                       2021-03-30 00:00:00
Duration                    363 days 00:00:00
Exposure Time [%]                   99.203187
Equity Final [$]                   352.022231
Equity Peak [$]                       10000.0
Return [%]                         -96.479778
Buy & Hold Return [%]              100.621644
Return (Ann.) [%]                  -96.526402
Volatility (Ann.) [%]                1.466514
Sharpe Ratio                              0.0
Sortino Ratio                             0.0
Calmar Ratio                              0.0
Max. Drawdown [%]                  -96.479778
Avg. Drawdown [%]                  -96.479778
Max. Drawdown Duration      362 days 00:00:00
Avg. Drawdown Duration      362 days 00:00:00
# Trades                                  250
Win Rate [%]                              6.4
Best Trade [%]                        0.58093
Worst Trade [%]                     -3.204015
Avg. Trade [%]                    

In the first attempt, this model lost us almost all of our money. This is understandable because this model uses only default values for hyperparameters, which results in only about 50% accuracy. This will need a lot of fine-tuning.
Also, the current strategy is very sensitive to price changes because even the slightest change is classified with either up or down. Therefore, if we are going long and the price experiences a small hiccup but the upward trend remains, our bot would just sell all the shares because of that hiccup.

## Save to Mongo DB

In [11]:
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(os.getcwd()), '..')))
from database.mongo_client import get_mongo_db_conn
from business_logic.model_crud import save_model_to_mongo
import pymongo
import time
import pickle

save_model_to_mongo(rfc, "RandomForestDefault")

# 3-class labelled data

In [12]:
data = get_OHLC_df(orig_data)
data = label_OHLC_df(data, 2, small_change_threshold=0.004)
data

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-04-04,39.768909,41.488656,39.742377,1.0,138422000
2018-04-05,41.626139,42.024116,41.505539,1.0,107732800
2018-04-06,41.237811,41.602020,40.569688,-1.0,140021200
2018-04-09,40.974904,41.749151,40.967668,-1.0,116070800
2018-04-10,41.727450,41.968650,41.372887,1.0,113634400
...,...,...,...,...,...
2021-03-24,122.820000,122.900002,120.070000,-1.0,88530500
2021-03-25,119.540001,121.660004,119.000000,-1.0,98844700
2021-03-26,120.349998,121.480003,118.919998,1.0,93958900
2021-03-29,121.650002,122.580002,120.730003,1.0,80819200


In [13]:
data.Close.value_counts()

 1.0    437
-1.0    257
 0.0     59
Name: Close, dtype: int64

Dataset is a bit skewed, so accuracy_score shouldn't be considered very seriously. 

In [14]:
X_train, X_test, y_train, y_test = split_train_test(data, split_date)

In [15]:
class ThreeClassRandomForestStrategy(BinaryClassificationStrategy):
    def init(self):
        self.clf = RandomForestClassifier()
        self.prepare_model(self.clf)

    def next(self):
        print(self.position)
        if self.data.df.index[-1] < self.split_date:
            return

        row = self.data.df.iloc[-1:]
        X = row[['Open', 'High', 'Low', 'Volume']]
        pred = self.clf.predict(X)[0]
        # print(f'Date: {str(row.index[0])} -- Pred: {pred} -- Actual: {row.Close.values[0]}')
        self.decide_trade(pred)

bt = Backtest(orig_data, ThreeClassRandomForestStrategy, commission=.0002, margin=.05)
stats = bt.run()

f1 score: 0.30321261296919316
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trades)>
<Position: 0 (0 trad

In [16]:
stats

Start                     2018-04-02 00:00:00
End                       2021-03-30 00:00:00
Duration                   1093 days 00:00:00
Exposure Time [%]                   33.112583
Equity Final [$]                    347.55818
Equity Peak [$]                       10000.0
Return [%]                         -96.524418
Buy & Hold Return [%]              198.235923
Return (Ann.) [%]                  -67.413951
Volatility (Ann.) [%]                8.379155
Sharpe Ratio                              0.0
Sortino Ratio                             0.0
Calmar Ratio                              0.0
Max. Drawdown [%]                  -96.524418
Avg. Drawdown [%]                  -96.524418
Max. Drawdown Duration      364 days 00:00:00
Avg. Drawdown Duration      364 days 00:00:00
# Trades                                  251
Win Rate [%]                         5.976096
Best Trade [%]                        0.58093
Worst Trade [%]                     -3.204015
Avg. Trade [%]                    

In [17]:
stat_df = stats['_trades']
stat_df

Unnamed: 0,Size,EntryBar,ExitBar,EntryPrice,ExitPrice,PnL,ReturnPct,EntryTime,ExitTime,Duration
0,653,504,504,61.163220,61.150990,-7.986319,-0.000200,2020-04-01,2020-04-01,0 days
1,670,505,505,59.634753,59.525183,-73.411891,-0.001837,2020-04-02,2020-04-02,0 days
2,658,506,506,60.245149,60.233102,-7.926676,-0.000200,2020-04-03,2020-04-03,0 days
3,-637,507,507,62.230069,62.242517,-7.929697,-0.000200,2020-04-06,2020-04-06,0 days
4,-589,508,508,67.165826,67.179262,-7.913717,-0.000200,2020-04-07,2020-04-07,0 days
...,...,...,...,...,...,...,...,...,...,...
246,-12,751,751,119.516093,120.570356,-12.651161,-0.008821,2021-03-25,2021-03-25,0 days
247,11,752,752,120.374068,120.107636,-2.930753,-0.002213,2021-03-26,2021-03-26,0 days
248,-11,753,753,121.625672,121.694839,-0.760843,-0.000569,2021-03-29,2021-03-29,0 days
249,-11,754,754,120.085979,120.110001,-0.264242,-0.000200,2021-03-30,2021-03-30,0 days


In [18]:
def test_print_results(clf):
    portfolio = Portfolio()
    symbol = 'AAPL'
    split_date = np.datetime64('2020-03-31')
    test_data = orig_data[orig_data.index > split_date]
    score = prepare_data_train_model(clf, orig_data, split_date, 2, 0.004)
    print('Model f1 score: ', score)
    print('\nTrading history: ')
    test_strategy(clf, portfolio, symbol, test_data)
    print('-'*20)
    print('Final balance: ', portfolio.balance)

In [19]:
clf = RandomForestClassifier()
test_print_results(clf)

Model f1 score:  0.336771291839785

Trading history: 
Initial balance:  4000
Pred: [1.] -- Added stock: AAPL, 59.76424026489258, 13 -- Total balance: 3223.0648765563965
Pred: [-1.] -- Dropped stock: AAPL, 59.76424026489258, 13 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 65.1127700805664, 12 -- Total balance: 3218.646759033203
Pred: [-1.] -- Dropped stock: AAPL, 65.1127700805664, 12 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 66.4821548461914, 12 -- Total balance: 3202.214141845703
Pred: [-1.] -- Dropped stock: AAPL, 66.4821548461914, 12 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 70.56053924560547, 11 -- Total balance: 3223.83406829834
Pred: [-1.] -- Dropped stock: AAPL, 70.56053924560547, 11 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 68.6999740600586, 11 -- Total balance: 3244.3002853393555
Pred: [-1.] -- Dropped stock: AAPL, 68.6999740600586, 11 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 72.72626495361328, 11 -- 

In [20]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

search_params = [
    {'n_estimators': [5, 10, 30, 100], 'criterion': ['gini', 'entropy']}, 
]
g_search = GridSearchCV(clf, search_params, scoring=f1_score, return_train_score=True, n_jobs=-1)
g_search.fit(X_train, y_train)
optimized = g_search.best_estimator_
test_print_results(optimized)

Model f1 score:  0.2794882639800164

Trading history: 
Initial balance:  4000
Pred: [1.] -- Added stock: AAPL, 59.76424026489258, 13 -- Total balance: 3223.0648765563965
Pred: [-1.] -- Dropped stock: AAPL, 59.76424026489258, 13 -- Total balance: 4000.0
Pred: [-1.] -- Added stock: AAPL, 65.1127700805664, -12 -- Total balance: 4781.353240966797
Pred: [1.] -- Dropped stock: AAPL, 65.1127700805664, -12 -- Total balance: 4000.0
Pred: [-1.] -- Added stock: AAPL, 71.21049499511719, -11 -- Total balance: 4783.315444946289
Pred: [1.] -- Dropped stock: AAPL, 71.21049499511719, -11 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 71.12120056152344, 11 -- Total balance: 3217.666793823242
Pred: [-1.] -- Dropped stock: AAPL, 71.12120056152344, 11 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 68.6999740600586, 11 -- Total balance: 3244.3002853393555
Pred: [-1.] -- Dropped stock: AAPL, 68.6999740600586, 11 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 68.49406433105469

Even after being optimized, the performance of this model was not much improved. This can be put down to the fact that the dataset is a time-series, or in other words, the order of each data points matters. However, this Randome Forest treats them in a non-sequential order. Let's enrich the dataset with some lagging indicators to include some relations to the past and see how the model performs:

In [21]:
from ta.volatility import BollingerBands
from ta.trend import SMAIndicator

def add_lagging_indicators(df):
    bands = BollingerBands(df.Close, fillna=True)
    sma10 = SMAIndicator(df.Close, 10, fillna=True)
    sma15 = SMAIndicator(df.Close, 15, fillna=True)
    sma20 = SMAIndicator(df.Close, 20, fillna=True)
    df['BBandsHigh'] = bands.bollinger_hband()
    df['BBandsLow'] = bands.bollinger_lband()
    df['SMA_10'] = sma10.sma_indicator()
    df['SMA_15'] = sma15.sma_indicator()
    df['SMA_25'] = sma20.sma_indicator()

add_lagging_indicators(orig_data)
orig_data

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,BBandsHigh,BBandsLow,SMA_10,SMA_15,SMA_25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-04-02,40.193425,40.748184,39.670024,40.203072,150347200,0.0,0.0,40.203072,40.203072,40.203072,40.203072,40.203072
2018-04-03,40.434621,40.702352,39.768913,40.615520,121112000,0.0,0.0,40.821745,39.996847,40.409296,40.409296,40.409296
2018-04-04,39.768909,41.488656,39.742377,41.392178,138422000,0.0,0.0,41.722888,39.750959,40.736923,40.736923,40.736923
2018-04-05,41.626139,42.024116,41.505539,41.679203,107732800,0.0,0.0,42.153600,39.791387,40.972493,40.972493,40.972493
2018-04-06,41.237811,41.602020,40.569688,40.613106,140021200,0.0,0.0,41.995455,39.805777,40.900616,40.900616,40.900616
...,...,...,...,...,...,...,...,...,...,...,...,...
2021-03-24,122.820000,122.900002,120.070000,120.089996,88530500,0.0,0.0,126.956670,117.048329,122.384999,121.521999,122.002499
2021-03-25,119.540001,121.660004,119.000000,120.589996,98844700,0.0,0.0,126.956050,117.008949,122.247999,121.552666,121.982499
2021-03-26,120.349998,121.480003,118.919998,121.209999,93958900,0.0,0.0,126.955050,117.004948,122.265999,121.538666,121.979999
2021-03-29,121.650002,122.580002,120.730003,121.389999,80819200,0.0,0.0,125.862371,117.457627,122.005999,121.873999,121.659999


In [22]:
data = orig_data.drop(['Dividends', 'Stock Splits'], axis=1)
data = label_OHLC_df(data, period=2)
data

Unnamed: 0_level_0,Open,High,Low,Close,Volume,BBandsHigh,BBandsLow,SMA_10,SMA_15,SMA_25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2018-04-04,39.768909,41.488656,39.742377,1.0,138422000,41.722888,39.750959,40.736923,40.736923,40.736923
2018-04-05,41.626139,42.024116,41.505539,1.0,107732800,42.153600,39.791387,40.972493,40.972493,40.972493
2018-04-06,41.237811,41.602020,40.569688,-1.0,140021200,41.995455,39.805777,40.900616,40.900616,40.900616
2018-04-09,40.974904,41.749151,40.967668,-1.0,116070800,41.922965,39.916697,40.919831,40.919831,40.919831
2018-04-10,41.727450,41.968650,41.372887,1.0,113634400,42.153540,39.934099,41.043819,41.043819,41.043819
...,...,...,...,...,...,...,...,...,...,...
2021-03-24,122.820000,122.900002,120.070000,-1.0,88530500,126.956670,117.048329,122.384999,121.521999,122.002499
2021-03-25,119.540001,121.660004,119.000000,-1.0,98844700,126.956050,117.008949,122.247999,121.552666,121.982499
2021-03-26,120.349998,121.480003,118.919998,1.0,93958900,126.955050,117.004948,122.265999,121.538666,121.979999
2021-03-29,121.650002,122.580002,120.730003,1.0,80819200,125.862371,117.457627,122.005999,121.873999,121.659999


In [23]:
rfc = RandomForestClassifier()
search_params = [
    {'n_estimators': [5, 10, 30, 100], 'criterion': ['gini', 'entropy']}, 
]
g_search = GridSearchCV(rfc, search_params, scoring=f1_score, return_train_score=True, n_jobs=-1)
g_search.fit(X_train, y_train)
optimized = g_search.best_estimator_
test_print_results(optimized)

Model f1 score:  0.3064082686242362

Trading history: 
Initial balance:  4000
Pred: [1.] -- Added stock: AAPL, 59.76424026489258, 13 -- Total balance: 3223.0648765563965
Pred: [-1.] -- Dropped stock: AAPL, 59.76424026489258, 13 -- Total balance: 4000.0
Pred: [-1.] -- Added stock: AAPL, 65.1127700805664, -12 -- Total balance: 4781.353240966797
Pred: [1.] -- Dropped stock: AAPL, 65.1127700805664, -12 -- Total balance: 4000.0
Pred: [-1.] -- Added stock: AAPL, 71.21049499511719, -11 -- Total balance: 4783.315444946289
Pred: [1.] -- Dropped stock: AAPL, 71.21049499511719, -11 -- Total balance: 4000.0
Pred: [-1.] -- Added stock: AAPL, 71.12120056152344, -11 -- Total balance: 4782.333206176758
Pred: [1.] -- Dropped stock: AAPL, 71.12120056152344, -11 -- Total balance: 4000.0
Pred: [-1.] -- Added stock: AAPL, 66.57642364501953, -12 -- Total balance: 4798.917083740234
Pred: [1.] -- Dropped stock: AAPL, 66.57642364501953, -12 -- Total balance: 4000.0
Pred: [1.] -- Added stock: AAPL, 68.228622436

Next, let's experiment with time-series classifiers.