# Cryptolytic Arbitrage Model Evaluation and Selection

This notebook contains the code and analysis to select models with the best performance for the Cryptolytic project. You can find more information on data processing in this [notebook](https://github.com/Cryptolytic-app/cryptolytic/blob/master/modeling/arbitrage_data_processing.ipynb) and modeling in this [notebook](https://github.com/Cryptolytic-app/cryptolytic/blob/master/modeling/arbitrage_modeling.ipynb).

#### Background on Arbitrage Models
Arbitrage models were created with the goal of predicting arbitrage 10 min before it happens in an active crypto market. The models are generated by getting all of the combinations of 2 exchanges that support the same trading pair, engineering technical analysis features, merging that data on 'closing_time', engineering more features, and creating a target that signals an arbitrage opportunity. Arbitrage signals predicted by the model have a direction indicating which direction the arbitrage occurs in. A valid arbitrage signal is when the arbitrage lasts >30 mins because it takes time to move coins from one exchange to the other in order to successfully complete the arbitrage trades.

The models predict whether there will be an arbitrage opportunity that starts 10 mins after the prediction time and lasts for at least 30 mins, giving a user enough times to execute trades.

More than 6000+ iterations of models were generated in this notebook and the best ones were selected from each possible arbitrage combination based on model selection criteria outlined later in this section. The models were Random Forest Classifier and the best model parameters varied for each dataset. The data was obtained from the respective exchanges via their api, and we did a 70/30 train/test split on 5 min candlestick data that fell anywhere in the range from Jun 2015 - Oct 2019. There was a 2 week gap left between the train and test sets to prevent data leakage. The models return 0 (no arbitrage), 1 (arbitrage from exchange 1 to exchange 2) and -1 (arbitrage from exchange 2 to exchange 1). 

The profit calculation incorporated fees like in the real world. We used mean percent profit as the profitability metric which represented the average percent profit per arbitrage trade if one were to act on all trades predicted by the model in the testing period, whether those predictions were correct or not.

#### Model Evaluation Criteria
- ROC AUC score
- Precison
- Recall
- F1 Score
- Status
- Profit



#### Model Selection
From the 6000+ iterations of models trained, the best models were narrowed down based on the following criteria:
- How often the models predicted arbitrage when it didn't exist (False positives)
- How many times the models predicted arbitrage correctly (True positives)
- How profitable the model was in the real world over the period of the test set.

#### Results and Discussion

For each of the models, show a dataframe of the LR scores, default RF scores, and hyperparm tuned RF scores.


There were 21 models that met the thresholds for model selection critera (details of these models can be found at the end of this nb). The final models were all profitable with gains anywhere from 0.2% - 2.3% within the varied testing time periods (Note: the model with >9% mean percent profit was an outlier). Visualizations for how these models performed can be viewed at https://github.com/Lambda-School-Labs/cryptolytic-ds/blob/master/finalized_notebooks/visualization/arb_performance_visualization.ipynb


#### Directory Structure

```
├── cryptolytic/                        <-- The top-level directory for all arbitrage work
│   ├── modeling/                       <-- Directory for modeling work
│   │      ├──data/                     <-- Directory with subdirectories containing 5 min candle data
│   │      │   ├─ arb_data/             <-- Directory for csv files of arbitrage model training data
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ csv_data/             <-- Directory for csv files after combining datasets and FE pt.2
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ ta_data/              <-- Directory for csv files after FE pt.1 
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ *.zip                 <-- ZIP files of all of the data
│   │      │   
│   │      ├──final_models/             <-- Directory for final models after model selection
│   │      │      └── *.pkl
│   │      │
│   │      ├──model_perf/               <-- Directory for performance csvs after training models
│   │      │      └── *.json
│   │      │
│   │      ├──models/                   <-- Directory for all pickle models
│   │      │      └── *.pkl
│   │      │
│   │      ├─arbitrage_data_processing.ipynb      <-- Notebook for data processing and creating csvs
│   │      │
│   │      ├─arbitrage_modeling.ipynb             <-- Notebook for baseline models and hyperparam tuning
│   │      │
│   │      ├─arbitrage_model_selection.ipynb      <-- Notebook for model selection
│   │      │
│   │      ├─arbitrage_model_evaluation.ipynb     <-- Notebook for final model evaluation
│   │      │
│   │      ├─environment.yml                      <-- yml file to create conda environment
│   │      │
│   │      ├─trade_recommender_models.ipynb       <-- Notebook for trade recommender models

```

## Imports

In [58]:
import glob
import pickle
import os
import shutil
import pickle
import json
import itertools
from zipfile import ZipFile
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")
import ast

import pandas as pd
import numpy as np
import datetime as dt

from ta import add_all_ta_features

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import precision_score, recall_score, classification_report, roc_auc_score
from sklearn.metrics import accuracy_score, accuracy_score, precision_score, f1_score, recall_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import ParameterGrid

from utils import tbl_stats_headings, tbl_stats_row, print_model_name
from utils import tbl_stats_headings, tbl_stats_row, print_model_name
from utils import get_higher_closing_price, get_close_shift, get_profit, profit
from utils import create_pg
from utils import confusion_feat
from utils import model_names
from utils import ttsplit
from utils import model_eval
from utils import create_models
# from utils import ALL_FEATURES

# from utils import *

## Data and Models

All the arbitrage datasets that will be used in modeling

In [59]:
arb_data_paths = glob.glob('data/arb_data/*.csv')
print(len(arb_data_paths))

13


In [60]:
pd.read_csv(arb_data_paths[1], index_col=0).head()

Unnamed: 0,open_exchange_1,high_exchange_1,low_exchange_1,close_exchange_1,base_volume_exchange_1,nan_ohlcv_exchange_1,volume_adi_exchange_1,volume_obv_exchange_1,volume_cmf_exchange_1,volume_fi_exchange_1,...,year,month,day,higher_closing_price,pct_higher,arbitrage_opportunity,window_length,arbitrage_opportunity_shift,window_length_shift,target
0,54.82,54.89,54.8,54.82,277.751023,0.0,-202.884436,-3721827.0,0.018414,-4.214039,...,2019,10,7,2,0.072966,0,10,0.0,45.0,0
1,54.82,54.95,54.81,54.94,354.164537,0.0,149.26348,-3721472.0,0.029671,7.992564,...,2019,10,7,1,0.145826,0,15,0.0,50.0,0
2,54.95,54.96,54.78,54.78,500.2785,0.0,-196.708897,-3721973.0,-0.04678,-8.901099,...,2019,10,7,2,0.237313,0,20,0.0,55.0,0
3,54.79,54.79,54.65,54.69,562.410097,0.0,-741.311399,-3722535.0,-0.045211,-52.06139,...,2019,10,7,2,0.402267,0,25,0.0,60.0,0
4,54.65,54.78,54.65,54.76,97.540523,0.0,-173.504844,-3722438.0,-0.031154,8.05476,...,2019,10,7,2,0.127831,0,30,0.0,65.0,0


In [61]:
pkls = glob.glob('models/*.pkl')
len(pkls)

5

## Modeling Functions

#### Functions for calculating profit

#### Function for train/test split

#### Function for performance metrics

In [3]:
def performance_metrics(pkls, features):
    """
    
    """
    # instantiate performance df
    columns = ['filename', 'model_id', 'parameters',
                'accuracy_score', 'mean_pct_profit',
                'precision', 'recall', 'f1_score',
                'support', 'correct_arb_preds']
    perf_df = pd.DataFrame(columns=columns)
    
    for pkl in pkls:
        
        # naming 
        file = '_'.join(pkl.split('/')[1].split('_')[:4])
        filepath = f'data/arb_data/{file}.csv'
        model_id = pkl.split('/')[1].split('.')[0]
        print('model_id:', model_id)
        
        # get features and parameters
        df, features, params = feat_n_params(pkl, filepath)
        
        # train/test split and predict
        X_train, X_test, y_train, y_test = tts(df, features)
        y_preds = predictions(pkl, X_test, y_test)
        
        # calculate stats
        pct_prof_mean, pct_prof_median = profit(X_test, y_preds)
        correct_arb_preds = confusion_feat(y_test, y_preds)
        cl_report = classification_report(y_test, y_preds, output_dict=True)
        print(classification_report(y_test, y_preds, output_dict=True))
        print(classification_report(y_test, y_preds))

        # append to perf_df
        perf_dict = {
            'filename': file,
            'model_id': model_id,
            'parameters': params,
            'accuracy_score': cl_report['accuracy'],
            'mean_pct_profit': pct_prof_mean,
            'precision': 0,
            'recall': 0,
            'f1_score': 0,
            'support': 0,
            'correct_arb_preds': correct_arb_preds
        }
        perf_df = perf_df.append(perf_dict, ignore_index=True)
        
    return perf_df, y_preds, y_test

arb_data_paths = glob.glob('data/arb_data/*.csv')
pkls = glob.glob('models/*.pkl')
perf_df, y_preds, y_test = performance_metrics(pkls, features)  

NameError: name 'features' is not defined

In [17]:
perf_df

Unnamed: 0,filename,model_id,parameters,accuracy_score,mean_pct_profit,precision,recall,f1_score,support,correct_arb_preds
0,kraken_bitfinex_ltc_btc,kraken_bitfinex_ltc_btc_lr,{},0.99679,-0.52,0,0,0,0,0
1,cbpro_hitbtc_bch_btc,cbpro_hitbtc_bch_btc_lr,{},0.996246,,0,0,0,0,0
2,kraken_hitbtc_ltc_btc,kraken_hitbtc_ltc_btc_auto_15_50,"{'max_features': 'auto', 'max_depth': '15', 'n...",0.998395,,0,0,0,0,0
3,kraken_hitbtc_ltc_btc,kraken_hitbtc_ltc_btc_rf,{},0.998395,,0,0,0,0,0
4,cbpro_bitfinex_bch_btc,cbpro_bitfinex_bch_btc_lr,{},0.989438,-0.01,0,0,0,0,2
5,cbpro_bitfinex_ltc_usd,cbpro_bitfinex_ltc_usd_lr,{},0.109485,0.06,0,0,0,0,6665
6,cbpro_hitbtc_dash_btc,cbpro_hitbtc_dash_btc_lr,{},0.000402,-0.4,0,0,0,0,1
7,cbpro_bitfinex_bch_btc,cbpro_bitfinex_bch_btc_rf,{},0.998538,9.4,0,0,0,0,67
8,cbpro_bitfinex_ltc_usd,cbpro_bitfinex_ltc_usd_rf,{},0.996847,3.09,0,0,0,0,6902
9,cbpro_hitbtc_dash_btc,cbpro_hitbtc_dash_btc_rf,{},0.002012,-0.4,0,0,0,0,1


In [3]:
features = ['close_exchange_1','base_volume_exchange_1', 
                    'nan_ohlcv_exchange_1','volume_adi_exchange_1', 'volume_obv_exchange_1',
                    'volume_cmf_exchange_1', 'volume_fi_exchange_1','volume_em_exchange_1', 
                    'volume_vpt_exchange_1','volume_nvi_exchange_1', 'volatility_atr_exchange_1',
                    'volatility_bbhi_exchange_1','volatility_bbli_exchange_1', 
                    'volatility_kchi_exchange_1', 'volatility_kcli_exchange_1',
                    'volatility_dchi_exchange_1','volatility_dcli_exchange_1',
                    'trend_macd_signal_exchange_1', 'trend_macd_diff_exchange_1', 
                    'trend_adx_exchange_1', 'trend_adx_pos_exchange_1', 
                    'trend_adx_neg_exchange_1', 'trend_vortex_ind_pos_exchange_1', 
                    'trend_vortex_ind_neg_exchange_1', 'trend_vortex_diff_exchange_1', 
                    'trend_trix_exchange_1', 'trend_mass_index_exchange_1', 
                    'trend_cci_exchange_1', 'trend_dpo_exchange_1', 'trend_kst_sig_exchange_1',
                    'trend_kst_diff_exchange_1', 'trend_aroon_up_exchange_1',
                    'trend_aroon_down_exchange_1', 'trend_aroon_ind_exchange_1',
                    'momentum_rsi_exchange_1', 'momentum_mfi_exchange_1',
                    'momentum_tsi_exchange_1', 'momentum_uo_exchange_1',
                    'momentum_stoch_signal_exchange_1', 'momentum_wr_exchange_1', 
                    'momentum_ao_exchange_1', 'others_dr_exchange_1', 'close_exchange_2',
                    'base_volume_exchange_2', 'nan_ohlcv_exchange_2',
                    'volume_adi_exchange_2', 'volume_obv_exchange_2',
                    'volume_cmf_exchange_2', 'volume_fi_exchange_2',
                    'volume_em_exchange_2', 'volume_vpt_exchange_2',
                    'volume_nvi_exchange_2', 'volatility_atr_exchange_2',
                    'volatility_bbhi_exchange_2', 'volatility_bbli_exchange_2',
                    'volatility_kchi_exchange_2', 'volatility_kcli_exchange_2',
                    'volatility_dchi_exchange_2', 'volatility_dcli_exchange_2',
                    'trend_macd_signal_exchange_2',
                    'trend_macd_diff_exchange_2', 'trend_adx_exchange_2',
                    'trend_adx_pos_exchange_2', 'trend_adx_neg_exchange_2',
                    'trend_vortex_ind_pos_exchange_2',
                    'trend_vortex_ind_neg_exchange_2',
                    'trend_vortex_diff_exchange_2', 'trend_trix_exchange_2',
                    'trend_mass_index_exchange_2', 'trend_cci_exchange_2',
                    'trend_dpo_exchange_2', 'trend_kst_sig_exchange_2',
                    'trend_kst_diff_exchange_2', 'trend_aroon_up_exchange_2',
                    'trend_aroon_down_exchange_2',
                    'trend_aroon_ind_exchange_2',
                    'momentum_rsi_exchange_2', 'momentum_mfi_exchange_2',
                    'momentum_tsi_exchange_2', 'momentum_uo_exchange_2',
                    'momentum_stoch_signal_exchange_2',
                    'momentum_wr_exchange_2', 'momentum_ao_exchange_2',
                    'others_dr_exchange_2', 'year', 'month', 'day',
                    'higher_closing_price', 'pct_higher', 
                    'arbitrage_opportunity', 'window_length']

## Model Selection

In [63]:
# note 1 
# the modeling function should have parameters called
# export_preds and export_model that can be set to true or 
# false (default false) so that we can use that later in the 
# evaluation notebook to actually export the preds and models

# note 2
# the modeling function should have a parameter called filename 
# that takes a filename for performance csv otherwise it'll overwrite
# the original when we retrain after model evaluation

# with open ('top_features.txt', 'rb') as fp:
#     top_features = pickle.load(fp)

In [64]:
perf_df = pd.read_csv('model_performance4.csv')
# perf_df = perf_df.rename(columns={'pct_profitMean': 'pct_profit_mean'})
print(perf_df.shape)

print(len(perf_df.drop_duplicates()))
perf_df.drop_duplicates()
perf_df.sort_values(by=['csv_name'])
# perf_df

(30, 17)
30


Unnamed: 0,model_id,csv_name,model_label,params,accuracy,precision,recall,f1,pct_profit_mean,pct_profit_median,pct_wrong_0,pct_wrong_1,pct_wrong_neg1,correct_arb,correct_arb_neg1,correct_arb_1,correct_arb_0
25,bitfinex_cbpro_btc_usd_rf_hyper_auto_15_50,bitfinex_cbpro_btc_usd,rf_hyper,"{'max_depth': 15, 'max_features': 'auto', 'n_e...",0.996296,0.9963168,0.996296,0.9962452,2.15,2.0,0.00556,0.0,3.7e-05,27311,26937,374,53661
5,bitfinex_cbpro_btc_usd_rf,bitfinex_cbpro_btc_usd,rf,{},0.997478,0.9974859,0.997478,0.9974299,2.14,1.99,0.003677,0.0,0.000259,27413,27037,376,53655
15,bitfinex_cbpro_btc_usd_lr,bitfinex_cbpro_btc_usd,lr,{},0.334108,0.1116285,0.334108,0.1673454,0.51,-0.06,,,0.665892,27154,27154,0,0
26,cbpro_bitfinex_bch_btc_rf_hyper_auto_15_50,cbpro_bitfinex_bch_btc,rf_hyper,"{'max_depth': 15, 'max_features': 'auto', 'n_e...",0.998863,0.9985389,0.998863,0.9986626,9.24,10.99,0.001142,,0.0,73,73,0,18369
6,cbpro_bitfinex_bch_btc_rf,cbpro_bitfinex_bch_btc,rf,{},0.998538,0.9982148,0.998538,0.9982986,9.4,10.99,0.001468,,0.0,67,67,0,18369
16,cbpro_bitfinex_bch_btc_lr,cbpro_bitfinex_bch_btc,lr,{},0.989438,0.9900136,0.989438,0.9897251,-0.01,-0.41,0.005011,,0.980952,2,2,0,18266
3,cbpro_bitfinex_eth_usd_rf,cbpro_bitfinex_eth_usd,rf,{},0.983494,0.9837253,0.983494,0.9807899,2.5,1.46,0.018655,0.005254,0.0,10451,38,10413,54077
23,cbpro_bitfinex_eth_usd_rf_hyper_auto_15_50,cbpro_bitfinex_eth_usd,rf_hyper,"{'max_depth': 15, 'max_features': 'auto', 'n_e...",0.997028,0.9970376,0.997028,0.9968697,2.36,1.41,0.003516,0.000365,0.0,11288,322,10966,54128
13,cbpro_bitfinex_eth_usd_lr,cbpro_bitfinex_eth_usd,lr,{},0.168127,0.8533067,0.168127,0.04841973,0.13,-0.22,0.0,0.831885,,11030,0,11030,1
0,cbpro_bitfinex_ltc_usd_rf,cbpro_bitfinex_ltc_usd,rf,{},0.996847,0.9968548,0.996847,0.9965437,3.09,3.8,0.002966,0.004782,0.0,6902,242,6660,53791


In [65]:
# Get the best performing models
    # import performance df
    # filter for:
        # - minimum number of arb
        # - minimum precision
        # - minimum recall
        # - minimum profit
    # sort by profit
    # save as top_models_df
    
def model_selection(perf_df):
    # filter
    
#     # filter for models that are predicting arb when its not happening < 30% of the time
#     df2 = df[df['pct_wrong_0'] < 0.30]
#     print('shape after filetering pct_wrong_0:', df2.shape)

#     # filter for models that predict > 25 correct arb 
#     df2 = df2[df2['correct_arb'] > 25]
#     print('shape after filtering correct_arb:', df2.shape)

#     # filter for models that make > 0.20% profit
#     df2 = df2[df2['pct_profit_mean'] > 0.2]
#     print('shape after filtering pct_profit_mean:', df2.shape)

    temp_df = perf_df[perf_df['pct_wrong_0'] < 0.30]
    temp_df = temp_df[temp_df['correct_arb'] > 100]
    temp_df = temp_df[temp_df['pct_profit_mean'] > 0.2]
    
    # keep default lr/rf for each good model
    top_models = temp_df['csv_name'].to_list()
    top_models_df = perf_df[perf_df['csv_name'].isin(top_models)]
    
    # sort
    top_models_df = top_models_df.sort_values(by='pct_profit_mean', ascending=False)
    
    return top_models_df

In [66]:
top_models_df = model_selection(perf_df)
top_models_df = top_models_df.sort_values(by=['csv_name'])
top_models_df


Unnamed: 0,model_id,csv_name,model_label,params,accuracy,precision,recall,f1,pct_profit_mean,pct_profit_median,pct_wrong_0,pct_wrong_1,pct_wrong_neg1,correct_arb,correct_arb_neg1,correct_arb_1,correct_arb_0
25,bitfinex_cbpro_btc_usd_rf_hyper_auto_15_50,bitfinex_cbpro_btc_usd,rf_hyper,"{'max_depth': 15, 'max_features': 'auto', 'n_e...",0.996296,0.996317,0.996296,0.996245,2.15,2.0,0.00556,0.0,3.7e-05,27311,26937,374,53661
5,bitfinex_cbpro_btc_usd_rf,bitfinex_cbpro_btc_usd,rf,{},0.997478,0.997486,0.997478,0.99743,2.14,1.99,0.003677,0.0,0.000259,27413,27037,376,53655
15,bitfinex_cbpro_btc_usd_lr,bitfinex_cbpro_btc_usd,lr,{},0.334108,0.111628,0.334108,0.167345,0.51,-0.06,,,0.665892,27154,27154,0,0
3,cbpro_bitfinex_eth_usd_rf,cbpro_bitfinex_eth_usd,rf,{},0.983494,0.983725,0.983494,0.98079,2.5,1.46,0.018655,0.005254,0.0,10451,38,10413,54077
23,cbpro_bitfinex_eth_usd_rf_hyper_auto_15_50,cbpro_bitfinex_eth_usd,rf_hyper,"{'max_depth': 15, 'max_features': 'auto', 'n_e...",0.997028,0.997038,0.997028,0.99687,2.36,1.41,0.003516,0.000365,0.0,11288,322,10966,54128
13,cbpro_bitfinex_eth_usd_lr,cbpro_bitfinex_eth_usd,lr,{},0.168127,0.853307,0.168127,0.04842,0.13,-0.22,0.0,0.831885,,11030,0,11030,1
0,cbpro_bitfinex_ltc_usd_rf,cbpro_bitfinex_ltc_usd,rf,{},0.996847,0.996855,0.996847,0.996544,3.09,3.8,0.002966,0.004782,0.0,6902,242,6660,53791
20,cbpro_bitfinex_ltc_usd_rf_hyper_auto_15_50,cbpro_bitfinex_ltc_usd,rf_hyper,"{'max_depth': 15, 'max_features': 'auto', 'n_e...",0.997996,0.998006,0.997996,0.99796,3.04,3.71,0.001096,0.009362,0.0,7003,337,6666,53760
10,cbpro_bitfinex_ltc_usd_lr,cbpro_bitfinex_ltc_usd,lr,{},0.109485,0.306656,0.109485,0.021639,0.06,-0.27,0.666667,0.890524,1.0,6665,0,6665,1
1,gemini_cbpro_ltc_btc_rf,gemini_cbpro_ltc_btc,rf,{},0.987882,0.988049,0.987882,0.987624,0.57,0.54,0.013765,0.0,0.0,978,444,534,7093


## Retrain and Export Best Models

In [67]:
# F = open('all_features.txt','r') 
# F.read()

In [79]:
# 2
# function to retrain best models and export preds csv 
#     - uses filename, model_label, and params from filtered perf df
#     - for each row in df:
#         - pass that info into the original function to retrain
#             this part will happen in the modeling function by 
#                 setting export_preds and export_model to true:
#              - merge X_test, y_test, and y_preds into a df
#              - export preds csv into a new folder data/arb_preds_test_data/
#                  - this needs to have some kind of naming convention 
#                      w/ model type bc we need to train models for all 3 sets
#              - export model into models/

# 3
# function to duplicate arb csv's of best models into a new folder
#      - for row in df, move csv to arb_top_data/



def train_best_models(df):
    top_models = df['model_id'].to_list()
    print(top_models)
    for model_id in top_models:
        print(model_id)
        filename = df[df['model_id'] == model_id]['csv_name'].values[0]
        filepath = f'data/arb_data/{filename}.csv'
        model_label = df[df['model_id'] == model_id]['model_label'].values[0]
        params = df[df['model_id'] == model_id]['params'].values[0]
        params = ast.literal_eval(params)
        
        if not params:
            continue
            print('1 no params')
        else:
            for key in params:

                if isinstance(params[key], list):
                    params['max_features'] = params['max_features']
                    params['n_estimators'] = params['n_estimators']
                    params['max_depth'] = params['max_depth']

                else:
                    params['max_features'] = [params['max_features']]
                    params['n_estimators'] = [params['n_estimators']]
                    params['max_depth'] = [params['max_depth']]


        print(params)
#         params['max_features'] = [params['max_features']]
#         params['n_estimators'] = [params['n_estimators']]
#         params['max_depth'] = [params['max_depth']]
        print(params)
#         try: 
        # duplicate csv data to new folder for easy export from sagemaker
        shutil.copyfile(filepath, f'data/arb_top_data/{filename}.csv')

        # model type and features
        if model_label == 'lr':
            model = LogisticRegression()
            features = []
        else:
            if model_label == 'rf':
                features = []
            else:
                with open ('top_features.txt', 'rb') as fp:
                    features = pickle.load(fp)
            model = RandomForestClassifier(
                n_jobs=-1,
                random_state=42
            )

        # train and export models
        create_models(
            arb_data_paths=[filepath],
            model_type=model,
            features=features,
            param_grid=params,
            filename='top_model_performance.csv',
            export_preds=True,
            export_model=True
        )
#         except:
#             print(f'ERROR: csv not found {filepath}')
            
        
        


In [80]:
train_best_models(top_models_df[11:])

['gemini_cbpro_ltc_btc_lr']
gemini_cbpro_ltc_btc_lr


In [None]:
# 4
# download from sagemaker:
#     - all models
#     - all good arb csv
#     - all arb preds csv
#     - performance csv

## Visualizations

In [10]:
# 5
# function to create visualization (for only one model set, 1 viz):
#         - takes the base csv_name for that model set and finds the 
#             3 matching csvs in arb_preds_test_data
#         - creates visualization that has 4 lines (trading 10K):
#             - cumulative value if holding bitcoin in that time period
#             - cumulative value if trading on arbitrage preds from best model
#             - cumulative value if trading on arbitrage preds from rf default
#             - cumulative value if trading on arbitrage preds from lr default
#         - display the visualization
#         - export the visualization into assets/visualizations/
#         - doesnt need to return anything

        
# 6       
# function to create the viz for all model sets:
#         - iterate through each row in performance df 
#             - define base model
#         - call visualization function for that base model


def arb_viz(csv_name):
    
    # data
    preds_lr = f'data/arb_preds_test_data/{csv_name}_lr.csv'
    preds_rf = f'data/arb_preds_test_data/{csv_name}_rf.csv'
    preds_rf_hyper = f'data/arb_preds_test_data/{csv_name}_*_*_*.csv'
    # dfs 
    preds_lr_df = pd.read_csv(preds_lr)
    preds_rf_df = pd.read_csv(preds_rf)
    preds_rf_hyper_df = pd.read_csv(preds_rf_hyper)

    # code for visualization using the 3 csv
    
    # display viz
    
    # export viz
    plt.savefig(f'{csv_name}_viz.png')
    

def create_arb_viz(df):
    csv_list = set(df['csv_name'].to_list)
    
    for csv_name in csv_list:
        arb_viz(csv_name)
    

In [23]:
ALL_FEATURES = ['close_exchange_1','base_volume_exchange_1', 
                    'nan_ohlcv_exchange_1','volume_adi_exchange_1', 'volume_obv_exchange_1',
                    'volume_cmf_exchange_1', 'volume_fi_exchange_1','volume_em_exchange_1', 
                    'volume_vpt_exchange_1','volume_nvi_exchange_1', 'volatility_atr_exchange_1',
                    'volatility_bbhi_exchange_1','volatility_bbli_exchange_1', 
                    'volatility_kchi_exchange_1', 'volatility_kcli_exchange_1',
                    'volatility_dchi_exchange_1','volatility_dcli_exchange_1',
                    'trend_macd_signal_exchange_1', 'trend_macd_diff_exchange_1', 
                    'trend_adx_exchange_1', 'trend_adx_pos_exchange_1', 
                    'trend_adx_neg_exchange_1', 'trend_vortex_ind_pos_exchange_1', 
                    'trend_vortex_ind_neg_exchange_1', 'trend_vortex_diff_exchange_1', 
                    'trend_trix_exchange_1', 'trend_mass_index_exchange_1', 
                    'trend_cci_exchange_1', 'trend_dpo_exchange_1', 'trend_kst_sig_exchange_1',
                    'trend_kst_diff_exchange_1', 'trend_aroon_up_exchange_1',
                    'trend_aroon_down_exchange_1', 'trend_aroon_ind_exchange_1',
                    'momentum_rsi_exchange_1', 'momentum_mfi_exchange_1',
                    'momentum_tsi_exchange_1', 'momentum_uo_exchange_1',
                    'momentum_stoch_signal_exchange_1', 'momentum_wr_exchange_1', 
                    'momentum_ao_exchange_1', 'others_dr_exchange_1', 'close_exchange_2',
                    'base_volume_exchange_2', 'nan_ohlcv_exchange_2',
                    'volume_adi_exchange_2', 'volume_obv_exchange_2',
                    'volume_cmf_exchange_2', 'volume_fi_exchange_2',
                    'volume_em_exchange_2', 'volume_vpt_exchange_2',
                    'volume_nvi_exchange_2', 'volatility_atr_exchange_2',
                    'volatility_bbhi_exchange_2', 'volatility_bbli_exchange_2',
                    'volatility_kchi_exchange_2', 'volatility_kcli_exchange_2',
                    'volatility_dchi_exchange_2', 'volatility_dcli_exchange_2',
                    'trend_macd_signal_exchange_2',
                    'trend_macd_diff_exchange_2', 'trend_adx_exchange_2',
                    'trend_adx_pos_exchange_2', 'trend_adx_neg_exchange_2',
                    'trend_vortex_ind_pos_exchange_2',
                    'trend_vortex_ind_neg_exchange_2',
                    'trend_vortex_diff_exchange_2', 'trend_trix_exchange_2',
                    'trend_mass_index_exchange_2', 'trend_cci_exchange_2',
                    'trend_dpo_exchange_2', 'trend_kst_sig_exchange_2',
                    'trend_kst_diff_exchange_2', 'trend_aroon_up_exchange_2',
                    'trend_aroon_down_exchange_2',
                    'trend_aroon_ind_exchange_2',
                    'momentum_rsi_exchange_2', 'momentum_mfi_exchange_2',
                    'momentum_tsi_exchange_2', 'momentum_uo_exchange_2',
                    'momentum_stoch_signal_exchange_2',
                    'momentum_wr_exchange_2', 'momentum_ao_exchange_2',
                    'others_dr_exchange_2', 'year', 'month', 'day',
                    'higher_closing_price', 'pct_higher', 
                    'arbitrage_opportunity', 'window_length']

In [24]:
print(ALL_FEATURES)

['close_exchange_1', 'base_volume_exchange_1', 'nan_ohlcv_exchange_1', 'volume_adi_exchange_1', 'volume_obv_exchange_1', 'volume_cmf_exchange_1', 'volume_fi_exchange_1', 'volume_em_exchange_1', 'volume_vpt_exchange_1', 'volume_nvi_exchange_1', 'volatility_atr_exchange_1', 'volatility_bbhi_exchange_1', 'volatility_bbli_exchange_1', 'volatility_kchi_exchange_1', 'volatility_kcli_exchange_1', 'volatility_dchi_exchange_1', 'volatility_dcli_exchange_1', 'trend_macd_signal_exchange_1', 'trend_macd_diff_exchange_1', 'trend_adx_exchange_1', 'trend_adx_pos_exchange_1', 'trend_adx_neg_exchange_1', 'trend_vortex_ind_pos_exchange_1', 'trend_vortex_ind_neg_exchange_1', 'trend_vortex_diff_exchange_1', 'trend_trix_exchange_1', 'trend_mass_index_exchange_1', 'trend_cci_exchange_1', 'trend_dpo_exchange_1', 'trend_kst_sig_exchange_1', 'trend_kst_diff_exchange_1', 'trend_aroon_up_exchange_1', 'trend_aroon_down_exchange_1', 'trend_aroon_ind_exchange_1', 'momentum_rsi_exchange_1', 'momentum_mfi_exchange_1'