---
# Machine Learning Process
---

![Process Diagram](Data_ML_models_training/Images/ML_process.png)



# Index
---

## Data Fetch

> Jupyter Notebook - [yfinance_data_for_training.ipynb](Data_ML_models_training/yfinance_data_for_training.ipynb)
>
> Jupyter Notebook - [CCXT_data_for_testing.ipynb](Data_ML_models_training/CCXT_data_for_testing.ipynb)
---

## Clean, Prepare & Manipulate Data

> Functions for calculating and adding Technical Indicators - [CronJobs/Utility_Functions/Functions.py](Data_ML_models_training/CronJobs/Utility_Functions/Functions.py)
>
> Computing Target Values, Feature Selection and Pre-processing - [Feature_sel_and_ML_training.ipynb](Data_ML_models_training/Feature_sel_and_ML_training.ipynb)

# Step 1 - Get Data

### Training Data from yfinance. 

[Jupyter Notebook](Data_ML_models_training/yfinance_data_for_training.ipynb)

In [4]:
# Import libraries and dependencies
import ccxt
import os
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import talib as ta
import yfinance as yf 
import datetime as dt 

from CronJobs.Utility_Functions import Functions

In [5]:
currs_list = [ 'BTC-AUD', 'ETH-AUD', 'XRP-AUD' , 'LTC-AUD', 'ADA-AUD', 'XLM-AUD', 'BCH-AUD']
start_date = '2019-06-01'
end_date = '2021-09-01'
interval = '1h'

df_data = yf.download(currs_list, start= start_date, end= end_date, interval= interval, group_by= 'ticker')

[*********************100%***********************]  7 of 7 completed


### Testing Data using CCXT with Kraken as Exchange

[Jupyter Notbook](Data_ML_models_training/CCXT_data_for_testing.ipynb)

In [8]:
currs_list = ['ETH/AUD', 'XRP/AUD' , 'LTC/AUD', 'ADA/AUD', 'XLM/AUD', 'BCH/AUD']     #

dict_ohlcv = {}

for curr in currs_list:
    
    # Call data fetch
    ohlcv = exchange.fetchOHLCV(curr, '1h')

    # Store the values in a dataframe
    df_ohlcv = pd.DataFrame(ohlcv, columns=['Date', 'Open', 'High', 'Low', 'Close', 'Volume']).set_index('Date')
    df_ohlcv.index = pd.to_datetime(df_ohlcv.index, unit='ms')

    df_ohlcv.dropna(inplace=True)

    # Store the symbol name and history data in a dict 
    dict_ohlcv[curr] = df_ohlcv 

    start_date = df_ohlcv.index[0].date().isoformat()
    end_date = df_ohlcv.index[-1].date().isoformat()
    num_records = (len(df_ohlcv))
    start_price = df_ohlcv.iloc[0]['Close']
    end_price = df_ohlcv.iloc[-1]['Close']

    print(f'Data summary for {curr}')
    print(f'    Start Date: {start_date}; End Date: {end_date}; NUmber of records: {num_records}')
    print(f'    Start Price: {start_price}; End Price: {end_price}')    
    print(f'Data for {curr} fetched and appended into the dictionary\n')

Data summary for ETH/AUD
    Start Date: 2021-08-06; End Date: 2021-09-05; NUmber of records: 720
    Start Price: 3796.86; End Price: 5267.66
Data for ETH/AUD fetched and appended into the dictionary

Data summary for XRP/AUD
    Start Date: 2021-08-06; End Date: 2021-09-05; NUmber of records: 720
    Start Price: 0.99639; End Price: 1.69548
Data for XRP/AUD fetched and appended into the dictionary

Data summary for LTC/AUD
    Start Date: 2021-08-06; End Date: 2021-09-05; NUmber of records: 720
    Start Price: 189.52; End Price: 292.43
Data for LTC/AUD fetched and appended into the dictionary

Data summary for ADA/AUD
    Start Date: 2021-08-06; End Date: 2021-09-05; NUmber of records: 720
    Start Price: 1.87313; End Price: 3.87422
Data for ADA/AUD fetched and appended into the dictionary

Data summary for XLM/AUD
    Start Date: 2021-08-06; End Date: 2021-09-05; NUmber of records: 720
    Start Price: 0.37792; End Price: 0.50057
Data for XLM/AUD fetched and appended into the dict

# Step 2 - Clean, Prepare, Manipulate Data and Feature Selection

## Function for calculating and adding Technical Indicators

Function .py file - [CronJobs/Utility_Functions/Functions.py](Data_ML_models_training/CronJobs/Utility_Functions/Functions.py)

Library - TA-lib

Indicators Used:

- Momentum Indicators - SMA, RSI, CCI, MACD
- Trend Strength Indicators - ADX
- Volatility Indicators - ATR, Bollinger Bands
- Volume Indicators - SMA(Volume)

In [9]:
def add_tech_indicators(df, fast, slow):
    
#---------------------------------------------------------------------
# Momentum Indicators
#---------------------------------------------------------------------
# SMA indicators fast and slow
    sma_fast = ta.SMA(df['Close'], timeperiod=fast )
    sma_slow = ta.SMA(df['Close'], timeperiod=slow )
    df['SMA_agg'] = sma_fast / sma_slow 

# RSI Ratio
    rsi_fast = ta.RSI(df['Close'], fast)
    rsi_slow = ta.RSI(df['Close'], slow)
    df['RSI_ratio'] = rsi_fast / rsi_slow

# CCI
    df['CCI'] = ta.CCI(df['High'], df['Low'], df['Close'], fast)

# MACD
# We'll be using MACD_ratio which is MACD / Signal. We will multiply it by -1 if MACD is lesser than 0. 
# So the MACD ratio will range from:
    # ratio < -1, when the MACD is below 0 and MACD is below Signal line 
    # -1 > ratio > 0, when the MACD is above Signal line but below 0
    # 0 < ratio < 1, when MACD is below signal but above 0 
    # 1 < ratio, when MACD is above 0 and above the signal line 
    df['MACD'], df['Signal'], hist = ta.MACD(df['Close'], fastperiod=fast, slowperiod=slow, signalperiod=8) 
    df['MACD_ratio'] =  df['MACD'] / df['Signal']
    df['MACD_ratio'] = df['MACD_ratio'] * df['MACD'] / abs(df['MACD'])          

    df.drop(columns= ['MACD', 'Signal'], inplace = True)

#---------------------------------------------------------------------
# Trend Strength Indicators
#---------------------------------------------------------------------
# ADX
    df['ADX'] = ta.ADX(df['High'], df['Low'], df['Close'], timeperiod= fast)
    df['plus_DI'] = ta.PLUS_DI(df['High'], df['Low'], df['Close'], timeperiod= fast)
    df['minus_DI'] = ta.MINUS_DI(df['High'], df['Low'], df['Close'], timeperiod= fast)
    df['ADX_dirn'] = np.where(df['plus_DI'] > df['minus_DI'], 1.0, 0.0)

    df.drop(columns=['plus_DI', 'minus_DI'], inplace=True)

#---------------------------------------------------------------------
# Volatility Indicators
#---------------------------------------------------------------------
# ATR Ratio: fast / slow. if value is less than 1, the price volatility is slowing
    atr_fast = ta.ATR(df['High'], df['Low'], df['Close'], timeperiod= fast)
    atr_slow = ta.ATR(df['High'], df['Low'], df['Close'], timeperiod= slow)
    df['ATR_ratio'] = atr_fast / atr_slow

# Bollinger Bands: periods = fast; Std.Dev = 1
    df['BBands_high'], middle, df['BBands_low']  = ta.BBANDS(df['Close'], timeperiod= fast, nbdevup= 1, nbdevdn= 1)

    df['BBands_high'] = df['BBands_high'] / df['Close']             # Value lesser than 1 will mean price has crossed above upper band
    df['BBands_low'] = df['Close'] / df['BBands_low']               # Value lesser than 1 will mean price has crossed below lower band

#---------------------------------------------------------------------
# Volume Indicators
#---------------------------------------------------------------------
# # SMA indicators fast and slow
    sma_vol_fast = ta.SMA(df['Volume'], timeperiod=fast )
    sma_vol_slow = ta.SMA(df['Volume'], timeperiod=slow )
    df['SMA_vol_agg'] = sma_vol_fast / sma_vol_slow 

    return df

## Data manipulation - Computing Target Values

In [None]:
df_data['Target_returns'] = df_data.Returns.shift(-1)
df_data.dropna(inplace=True)
df_data['Buy_or_sell'] = df_data.Target_returns.apply(lambda x: 1 if x > 0 else 0)

## Feature Selection 

[Jupyter Notbook](Data_ML_models_training/Feature_sel_and_ML_training.ipynb)

Library used sklearn.feature_selection

Techniques used - SelectKBest(f_classific), VarianceThreshold(threshold of 0.8)

Process - Ran the SelectKBest method and discarded 3 features which were least relevant

## Resampling 

[Jupyter Notbook](Data_ML_models_training/Feature_sel_and_ML_training.ipynb)

Library used imblearn.combine

Technique used - SMOTEENN

# Step 3 - ML Model Selection, Hypertuning and Training

[Jupyter Notbook](Data_ML_models_training/Feature_sel_and_ML_training.ipynb)

Libraries used - sklearn.model_selection, sklearn.preprocessing, sklearn.compose, sklearn.decomposition, sklearn.pipeline

Models used for evaluation

- SVC
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
- Ada Boost

Steps in Pipeline

- StandardScaler
- Model

Steps in Model Selection

- sklearn.model_selection -> cross_val_score using 'roc_auc' and 'accuracy'
- sklearn.model_selection -> GridSearchCV using 'roc_auc'

## Pre-processing steps

1. Column Transformation using `make_column_transformer` method of sklearn.compose

2. Column Transformer to run StandardScaler on the 8 best indicators determined in Feature Selection

3. Defined pipeline with column transformation as first step

In [None]:
# Column Transformer with the 8 best indicators selected from Feature Selection step
col_transform = make_column_transformer(
    (StandardScaler(), best_inds ),
    remainder='drop'
)
col_transform.fit_transform(X);

In [None]:
# Defining pipeline with the columntransformer and model selected for training
if model_for_testing == 'svc': model = ('svc', SVC() )
elif model_for_testing == 'logreg': model = ('logreg', LogisticRegression( ))
elif model_for_testing == 'dec_tree': model = ('dec_tree', DecisionTreeClassifier())
elif model_for_testing == 'forest': model = ('forest', RandomForestClassifier( ))
elif model_for_testing == 'grad_boost': model = ('grad_boost', GradientBoostingClassifier())
elif model_for_testing == 'ada_boost': model = ('ada_boost', AdaBoostClassifier())

pipe = Pipeline(steps= [('col_transform', col_transform), 
                    # ('pca', pca),
                    model
                    ])

## Model Selection

Steps involved:

1. Ran `cross_val_score` with on the models with default parameters and 10-fold Cross Validation to get the benchmark scores

2. Ran GridSearchCV on a range of parameters to get the optimal model configuration

In [None]:
# cross-validate the entire process
# thus, preprocessing occurs within each fold of cross-validation
cross_val_roc_auc = cross_val_score(pipe, X, y, cv=10, scoring='roc_auc', n_jobs=20).mean()

cross_val_accuracy = cross_val_score(pipe, X, y, cv=10, scoring='accuracy', n_jobs=20).mean()

In [None]:
grid = GridSearchCV(pipe, params, cv=10, scoring='roc_auc', n_jobs=20)
grid.fit(X,y);

print(f'Score: {grid.best_score_}')
print(f'Best params: {grid.best_params_}')
estimator = grid.best_estimator_[model_for_testing]

grid_best_params = str(grid.best_params_)
grid_best_params
gridcv_best_score = grid.best_score_

## Fitting the pipeline with the tuned model and saving it to a joblib

In [None]:
# Fitting the pipeline, with the tuned model
pipeline = make_pipeline(col_transform, 
            # pca, 
            estimator)
pipeline.fit(X, y)

In [None]:
from joblib import dump, load
from pathlib import Path

filename = Path('Joblibs/' + dt.date.today().isoformat() + '_' + model_for_testing + '_Feat_sel.joblib')
dump(pipeline, filename)

# Step 4 - Testing

[Jupyter Notbook](Data_ML_models_training/Backtest_Mach_learn.ipynb)

## Set Control parameters for the backtest and run the predictions

In [None]:
model_for_testing = '2021-09-01_ada_boost_Feat_sel.joblib'

curr_list = ['BTC/AUD', 'ETH/AUD', 'XRP/AUD', 'LTC/AUD', 'ADA/AUD', 'XLM/AUD', 'BCH/AUD']

all_inds = ['SMA_agg', 'RSI_ratio', 'CCI', 'MACD_ratio', 'ADX', 'ADX_dirn', 'ATR_ratio', 'BBands_high', 'BBands_low', 'SMA_vol_agg', 'Returns']

df_cml_rets = pd.DataFrame()
df_hourly_rets = pd.DataFrame()

for curr_tested in curr_list:

    df_testing_subset = df_all_data.loc[ df_all_data.Currency == curr_tested].copy()
    df_testing_subset.sort_index(inplace=True)

    X_test = df_testing_subset.loc[: , all_inds].reset_index(drop=True)   
    y_test = df_testing_subset.loc[:, ['Target_returns', 'Buy_or_sell']].copy()

    # Run the predictions
    df_pred = y_test
    df_pred['Pred_buy_or_sell'] = pipeline.predict(X_test)

    print(f'\nClassification report for {curr_tested}')
    print(classification_report(y_test.Buy_or_sell, df_pred.Pred_buy_or_sell))

    hourly_returns = df_pred['Target_returns'] * df_pred['Pred_buy_or_sell']
    cum_rets = (1 + hourly_returns).cumprod()
    total_returns = round((cum_rets[-1] - cum_rets[0]) * 100, 2)

    col_name = 'hourly_rets_'  + (curr_tested.replace('/', '-'))
    df_hourly_rets[col_name] = hourly_returns

    col_name = 'cum_rets_' + (curr_tested.replace('/', '-'))
    df_cml_rets[col_name] = cum_rets

## Evaluate Performance Metrics

In [None]:
metrics = [ 'Annual Return', 'Cumulative Returns', 'Annual Volatility', 'Sharpe Ratio', 'Sortino Ratio']

columns = ['Backtest']

# Initialize the DataFrame with index set to evaluation metrics and column as `Backtest` (just like PyFolio)
portfolio_evaluation_df = pd.DataFrame(index=metrics, columns=columns)

# Calculate cumulative return
portfolio_evaluation_df.loc['Cumulative Returns'] = df_portfolio_returns['cum_rets_agg'][-1]

# Calculate annualized return
portfolio_evaluation_df.loc['Annual Return'] = ( df_portfolio_returns['hourly_rets_agg'].mean() * 24 * 365 )

# Calculate annual volatility
portfolio_evaluation_df.loc['Annual Volatility'] = ( df_portfolio_returns['hourly_rets_agg'].std() * np.sqrt(24 * 365) )

# Calculate Sharpe Ratio
portfolio_evaluation_df.loc['Sharpe Ratio'] = ( df_portfolio_returns['hourly_rets_agg'].mean() * 24 * 365) / 
    ( df_portfolio_returns['hourly_rets_agg'].std() * np.sqrt(24 * 365) )

# Calculate Downside Return
sortino_ratio_df = df_portfolio_returns[['hourly_rets_agg']].copy()
sortino_ratio_df.loc[:,'Downside Returns'] = 0

target = 0
mask = sortino_ratio_df['hourly_rets_agg'] < target
sortino_ratio_df.loc[mask, 'Downside Returns'] = sortino_ratio_df['hourly_rets_agg']**2

# Calculate Sortino Ratio
down_stdev = np.sqrt(sortino_ratio_df['Downside Returns'].mean()) * np.sqrt(24 * 365)
expected_return = sortino_ratio_df['hourly_rets_agg'].mean() * 24 * 365
sortino_ratio = expected_return/down_stdev

portfolio_evaluation_df.loc['Sortino Ratio'] = sortino_ratio

## Backtest Results

Strategy Performance on individual currency pairs

|![BTC-AUD](Data_ML_models_training/Backtest_results/Results_BTC-AUD.png) | ![ETH-AUD](Data_ML_models_training/Backtest_results/Results_ETH-AUD.png) | ![BCH-AUD](Data_ML_models_training/Backtest_results/Results_BCH-AUD.png)
|---|---|---
|![ADA-AUD](Data_ML_models_training/Backtest_results/Results_ADA-AUD.png) | ![XRP-AUD](Data_ML_models_training/Backtest_results/Results_XRP-AUD.png) | ![XLM-AUD](Data_ML_models_training/Backtest_results/Results_XLM-AUD.png)
|![LTC-AUD](Data_ML_models_training/Backtest_results/Results_LTC-AUD.png)||

Strategy Performance for the portfolio of the 7 currency pairs

Cumulative Returns

> ![Portfolio Returns](Data_ML_models_training/Backtest_results/Portfolio_Results-2021-08-04_to_2021-09-03.png)

Metrics

> ![Portfolio metrics](Data_ML_models_training/Backtest_results/Portfolio_metrics.png)