# Cryptolytic Arbitrage Modeling

This notebook contains the code to create the arbitrage models used in the Cryptolytic project. You can find more information on data processing in this [notebook](link) and model evaluation in this [notebook](link).

#### Background on Arbitrage Models
Arbitrage models were created with the goal of predicting arbitrage 10 min before it happens in an active crypto market. The models are generated by getting all of the combinations of 2 exchanges that support the same trading pair, engineering technical analysis features, merging that data on 'closing_time', engineering more features, and creating a target that signals an arbitrage opportunity. Arbitrage signals predicted by the model have a direction indicating which direction the arbitrage occurs in. A valid arbitrage signal is when the arbitrage lasts >30 mins because it takes time to move coins from one exchange to the other in order to successfully complete the arbitrage trades.

The models predict whether there will be an arbitrage opportunity that starts 10 mins after the prediction time and lasts for at least 30 mins, giving a user enough times to execute trades.

#### Baseline Logistic Regression

#### Baseline Random Forest with default parameters

#### Feature Selection

#### Random Forest with hyperparameter tuning

More than 6000+ iterations of models were generated in this notebook and the best ones were selected from each possible arbitrage combination based on model selection criteria outlined later in this section. The models were Random Forest Classifier and the best model parameters varied for each dataset. The data was obtained from the respective exchanges via their api, and we did a 70/30 train/test split on 5 min candlestick data that fell anywhere in the range from Jun 2015 - Oct 2019. There was a 2 week gap left between the train and test sets to prevent data leakage. The models return 0 (no arbitrage), 1 (arbitrage from exchange 1 to exchange 2) and -1 (arbitrage from exchange 2 to exchange 1). 

The profit calculation incorporated fees like in the real world. We used mean percent profit as the profitability metric which represented the average percent profit per arbitrage trade if one were to act on all trades predicted by the model in the testing period, whether those predictions were correct or not.

From the 6000+ iterations of models trained, the best models were narrowed down based on the following criteria:
- How often the models predicted arbitrage when it didn't exist (False positives)
- How many times the models predicted arbitrage correctly (True positives)
- How profitable the model was in the real world over the period of the test set. 

There were 21 models that met the thresholds for model selection critera (details of these models can be found at the end of this nb). The final models were all profitable with gains anywhere from 0.2% - 2.3% within the varied testing time periods (Note: the model with >9% mean percent profit was an outlier). Visualizations for how these models performed can be viewed at https://github.com/Lambda-School-Labs/cryptolytic-ds/blob/master/finalized_notebooks/visualization/arb_performance_visualization.ipynb

\* It is HIGHLY recommended to run this on sagemaker and split the training work onto 4 notebooks. These functions will take over a day to run if not split up. There are 95 total options for models, 75 of those options have enough data to train models, and with different options for parameters around ~6K models will be trained. After selecting for the best models, there were 21 that met the criteria to be included in this project.

\*** There has been some feature selection done in this process where we removed highly correlated features, but not enough. There should be more exploration into whether removing features improves accuracy. 

\**** We haven't tried normalizing the dataset to see if it will improve accuracy, but that should be a top priority to anyone continuing this project

#### Directory Structure

```
├── cryptolytic/                        <-- The top-level directory for all arbitrage work
│   ├── modeling/                       <-- Directory for modeling work
│   │      ├──data/                     <-- Directory with subdirectories containing 5 min candle data
│   │      │   ├─ arb_data/             <-- Directory for csv files of arbitrage model training data
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ csv_data/             <-- Directory for csv files after combining datasets and FE pt.2
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ ta_data/              <-- Directory for csv files after FE pt.1 
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ *.zip                 <-- ZIP files of all of the data
│   │      │   
│   │      ├──final_models/             <-- Directory for final models after model selection
│   │      │      └── *.pkl
│   │      │
│   │      ├──model_perf/               <-- Directory for performance csvs after training models
│   │      │      └── *.json
│   │      │
│   │      ├──models/                   <-- Directory for all pickle models
│   │      │      └── *.pkl
│   │      │
│   │      ├─arbitrage_data_processing.ipynb      <-- Notebook for data processing and creating csvs
│   │      │
│   │      ├─arbitrage_modeling.ipynb             <-- Notebook for baseline models and hyperparam tuning
│   │      │
│   │      ├─arbitrage_model_selection.ipynb      <-- Notebook for model selection
│   │      │
│   │      ├─arbitrage_model_evaluation.ipynb     <-- Notebook for final model evaluation
│   │      │
│   │      ├─environment.yml                      <-- yml file to create conda environment
│   │      │
│   │      ├─trade_recommender_models.ipynb       <-- Notebook for trade recommender models

```

## Imports

In [3]:
import glob
import os
import pickle
import json
import itertools
from zipfile import ZipFile
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import datetime as dt

from ta import add_all_ta_features

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import precision_score, recall_score, classification_report, roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

## Data

All the arbitrage datasets that will be used in modeling

In [5]:
arb_data_paths = glob.glob('data/arb_data/*.csv')
print(len(arb_data_paths))

0


In [6]:
pd.read_csv(arb_data_paths[0], index_col=0).head()

IndexError: list index out of range

## Modeling Functions

#### Features

Note: closing_time feature is being removed before modeling

In [200]:
features = ['close_exchange_1','base_volume_exchange_1', 
            'nan_ohlcv_exchange_1','volume_adi_exchange_1', 'volume_obv_exchange_1',
            'volume_cmf_exchange_1', 'volume_fi_exchange_1','volume_em_exchange_1', 
            'volume_vpt_exchange_1','volume_nvi_exchange_1', 'volatility_atr_exchange_1',
            'volatility_bbhi_exchange_1','volatility_bbli_exchange_1', 
            'volatility_kchi_exchange_1', 'volatility_kcli_exchange_1',
            'volatility_dchi_exchange_1','volatility_dcli_exchange_1',
            'trend_macd_signal_exchange_1', 'trend_macd_diff_exchange_1', 
            'trend_adx_exchange_1', 'trend_adx_pos_exchange_1', 
            'trend_adx_neg_exchange_1', 'trend_vortex_ind_pos_exchange_1', 
            'trend_vortex_ind_neg_exchange_1', 'trend_vortex_diff_exchange_1', 
            'trend_trix_exchange_1', 'trend_mass_index_exchange_1', 
            'trend_cci_exchange_1', 'trend_dpo_exchange_1', 'trend_kst_sig_exchange_1',
            'trend_kst_diff_exchange_1', 'trend_aroon_up_exchange_1',
            'trend_aroon_down_exchange_1', 'trend_aroon_ind_exchange_1',
            'momentum_rsi_exchange_1', 'momentum_mfi_exchange_1',
            'momentum_tsi_exchange_1', 'momentum_uo_exchange_1',
            'momentum_stoch_signal_exchange_1', 'momentum_wr_exchange_1', 
            'momentum_ao_exchange_1', 'others_dr_exchange_1', 'close_exchange_2',
            'base_volume_exchange_2', 'nan_ohlcv_exchange_2',
            'volume_adi_exchange_2', 'volume_obv_exchange_2',
            'volume_cmf_exchange_2', 'volume_fi_exchange_2',
            'volume_em_exchange_2', 'volume_vpt_exchange_2',
            'volume_nvi_exchange_2', 'volatility_atr_exchange_2',
            'volatility_bbhi_exchange_2', 'volatility_bbli_exchange_2',
            'volatility_kchi_exchange_2', 'volatility_kcli_exchange_2',
            'volatility_dchi_exchange_2', 'volatility_dcli_exchange_2',
            'trend_macd_signal_exchange_2',
            'trend_macd_diff_exchange_2', 'trend_adx_exchange_2',
            'trend_adx_pos_exchange_2', 'trend_adx_neg_exchange_2',
            'trend_vortex_ind_pos_exchange_2',
            'trend_vortex_ind_neg_exchange_2',
            'trend_vortex_diff_exchange_2', 'trend_trix_exchange_2',
            'trend_mass_index_exchange_2', 'trend_cci_exchange_2',
            'trend_dpo_exchange_2', 'trend_kst_sig_exchange_2',
            'trend_kst_diff_exchange_2', 'trend_aroon_up_exchange_2',
            'trend_aroon_down_exchange_2',
            'trend_aroon_ind_exchange_2',
            'momentum_rsi_exchange_2', 'momentum_mfi_exchange_2',
            'momentum_tsi_exchange_2', 'momentum_uo_exchange_2',
            'momentum_stoch_signal_exchange_2',
            'momentum_wr_exchange_2', 'momentum_ao_exchange_2',
            'others_dr_exchange_2', 'year', 'month', 'day',
            'higher_closing_price', 'pct_higher', 
            'arbitrage_opportunity', 'window_length']

#### Functions for print statements

In [9]:
line = '-------------'
sp = '      '

def tbl_stats_headings():
    """Prints the headings for the stats table"""
    print(sp*2, line*9, '\n', 
          sp*3, 'Accuracy Score', 
#           sp, 'True Positive Rate',
          sp, 'False Postitive Rate', 
          sp, 'Precision',
          sp, 'Recall',
          sp, 'F1', '\n',
          sp*2, line*9, '\n', 
    )
    
def tbl_stats_row(test_accuracy, fpr, precision, recall, f1):
    """Prints the row of model stats after each param set fold"""
    print(
        sp*4, f'{test_accuracy:.4f}',     # accuracy
#         sp*3, f'{tpr:.4f}',           # roc auc
        sp*3, f'{fpr:.4f}',      # p/r auc
        sp*2, f'{precision:.4f}',      # p/r auc
        sp*1, f'{recall:.4f}',      # p/r auc
        sp*1, f'{f1:.4f}',     # p/r auc
        sp*2, line*9
    )

def print_model_name(name, i, arb_data_paths):
    print(
    line*9, '\n\n', 
    f'Model {i+1}/{len(arb_data_paths)}: {name}', '\n', 
    line*9
    )

def print_model_params(i, params, pg_list):  
    print(
        line*9, '\n', 
        f'Model {i+1} / {len(pg_list)}', '\n',  
        f'params={params if params else None}', '\n', 
        line*9
    )

#### Function for modeling

In [312]:
def create_models(arb_data_paths, model_type, features, param_grid):
    """
    This function takes in a list of all the arbitrage data paths, 
    does train/test split, feature selection, trains models, 
    saves the pickle file, gets performance stats for the model, 
    and returns a dataframe of performance stats and a dictionary
    of confusion matrices for each model 

    Predictions
    ___________
    
    Models predict whether arbitrage is occuring at a given time:
    1: Active
    0: Inactive
    -1: 
    
    Evaluation
    __________
    
    - Cross Validation
    - Accuracy Score
    - ROC AUC
    - Precision/Recall AUC
    - ROC curves
    - Precision/Recall curves
    - Mean Percent Profit

    Parameters
    __________
    
    df: a dataframe with columns=['CANONICAL_SMILES', 'MOD', 
                                  'DOF_IC50_uM', 'Class', 'Source']
    param_grid: a dict of hyperparameters for RandomForestClassifier
    pc: number of principal components to use in training
    n_splits: number of cross validation folds (default=5)
    """
    
    counter = 0
    line = '---------------'
    
    base_model_name = str(model_type).split('(')[0]
    model_name_dict = {
        'LogisticRegression':'lr',
        'RandomForestClassifier':'rf'
    }
    
    # this is in case the function stops running you can pick up where you left off
    # get all model paths into a variable
    model_paths = glob.glob('models2/*.pkl')
    
    # pick target
    target = 'target'
    
    # iterate through the arbitrage csvs
    for i, file in enumerate(arb_data_paths[:3]):
        
        # define model name
        name = file.split('/')[2][:-8]
        
        # print status
        print_model_name(name, i, arb_data_paths)

        # read csv
        df = pd.read_csv(file, index_col=0)
        
        # convert str closing_time to datetime
        df['closing_time'] = pd.to_datetime(df['closing_time']) 
        
#         X = df[features]
#         y = df[target]
        
        # baseline
        if not param_grid:
            pg_list = [param_grid]
        # create parameter grid
        else:
            pg_list = list(ParameterGrid(param_grid))
                
        # cv = TimeSeriesSplit(n_splits)
        
        for i, params in enumerate(pg_list):    
            # define model 
            # need if else
            if param_grid:
                model_name = name + '_' + str(max_features) + '_' + str(max_depth) + '_' + str(n_estimators)
            else:
                model_name = name + '_' + model_name_dict[base_model_name]

            # define model filename to check if it exists
            model_path = f'models/{model_name}.pkl'

            # if the model does not exist
            if model_path not in model_paths:
                
                # print status
                print_model_params(i, params, pg_list)
                
                # 70/30 train/test split
                test_train_split_row = round(len(df)*.7)

                # get closing_time for t/t split
                test_train_split_time = df['closing_time'][test_train_split_row]

                # remove 2 weeks from train datasets to create a  
                # two week gap between the data - prevents data leakage
                train_cutoff_time = test_train_split_time - dt.timedelta(days=14)
                print('cutoff time:', train_cutoff_time)

                # train and test subsets
                train = df[df['closing_time'] < train_cutoff_time]
                test = df[df['closing_time'] > train_cutoff_time]
        

                # get closing_time for t/t split
                # remove 2 weeks from train datasets to create a 
                # two week gap between the data - prevents data leakage
#                 train_cutoff_time = X_train['closing_time'].iloc[-1] - dt.timedelta(days=14)
#                 print('cutoff time:', train_cutoff_time)

#                 # train and test subsets
#                 X_train = X_train[X_train['closing_time'] < train_cutoff_time]
#                 y_train = y_train[:X_train.shape[0]]

                # X, y matrix
                X_train = train[features]
                X_test = test[features]
                y_train = train[target]
                y_test = test[target]

#                 X_train = X_train.drop(columns='closing_time')
#                 X_test = X_test.drop(columns='closing_time')

                # printing shapes to track progress
                print('train and test shape: ', train.shape, test.shape)

                # filter out datasets that are too small
                if (X_train.shape[0] > 1000) and (X_test.shape[0] > 100):

                    # instantiate model
                    model = model_type.set_params(**params)

                    # there was a weird error caused by two of the datasets which
                    # is why this try/except is needed to keep the function running
#                         try:

                    # fit model
                    model = model.fit(X_train, y_train)
                    print('model fitted!')

                    # make predictions
                    y_preds = model.predict(X_test)
                    print('predictions made!')

                    # test accuracy
                    score = accuracy_score(y_test, y_preds)
                    print('test accuracy:', score)

#                     print(classification_report(y_test, y_preds))
                    # fpr
#                     print(confusion_matrix(y_test, y_preds))
                    fpr = confusion_matrix(y_test, y_preds)[0][2]
                    
                    
                    # precision, recall, f1 score, supp
                    precision, recall, f1, supp = precision_recall_fscore_support(y_test, y_preds, average='weighted')

                    print(roc_auc_score(y_test, y_preds))
                    
                    # save model
                    pickle.dump(model, open('models/{model_name}.pkl'.format(
                                model_name=model_name), 'wb'))
                    print('pickle saved!'.format(model_name=model_name))

#                         except:
#                             print(line*3 + '\n' + line + 'ERROR' + line + '\n' + line*3)
#                             break # break out of for loop if there is an error with modeling


                # dataset is too small
                else:
                    fpr, precision, recall, f1 = .00001, .00001, .00001, .00001
                    print(f'ERROR: dataset too small for {name}')
                
                # print status
                tbl_stats_headings()
                tbl_stats_row(score, fpr, precision, recall, f1)

            # if the model exists
            else:
                print(f'Model {i}/{len(arb_data_paths)} already exists.')
       
        
        # update count
        # TODO: make a better counter that is actually useful in 
        # showing how much is left 
        counter += 1
        print(counter, '\n')

In [313]:
create_models(arb_data_paths=arb_data_paths, model_type=LogisticRegression(), features=features, param_grid={})

--------------------------------------------------------------------------------------------------------------------- 

 Model 1/95: kraken_bitfinex_bch 
 ---------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------- 
 Model 1 / 1 
 params=None 
 ---------------------------------------------------------------------------------------------------------------------
cutoff time: 2019-10-11 07:00:00
train and test shape:  (2938, 141) (7018, 141)
model fitted!
predictions made!
test accuracy: 0.9955827871188373
0.37692472198460225
pickle saved!
             --------------------------------------------------------------------------------------------------------------------- 
                    Accuracy Score        False Postitive Rate        Precision        Recall        F1 
              ------------------------------

ValueError: multiclass format is not supported

In [236]:
def preformance_metric():
    ############## Performance metrics ###############
    # TODO: put this all in a function and just return the 
    # metrics we want

    performance_list = []
    confusion_dict = {}
    
    
    # labels for confusion matrix
    unique_y_test = y_test.unique().tolist()
    unique_y_preds = list(set(y_preds))
    labels = list(set(unique_y_test + unique_y_preds))
    labels.sort()
    columns = [f'Predicted {label}' for label in labels]
    index = [f'Actual {label}' for label in labels]

    # create confusion matrix
    confusion = pd.DataFrame(confusion_matrix(y_test, y_preds),
                             columns=columns, index=index)
    print(model_name + ' confusion matrix:')
    print(confusion, '\n')

    # append to confusion list
    confusion_dict[model_name] = confusion

    # creating dataframe from test set to calculate profitability
    test_with_preds = X_test.copy()

    # add column with higher closing price
    test_with_preds['higher_closing_price'] = test_with_preds.apply(
            get_higher_closing_price, axis=1)

    # add column with shifted closing price
    test_with_preds = get_close_shift(test_with_preds)

    # adding column with predictions
    test_with_preds['pred'] = y_preds

    # adding column with profitability of predictions
    test_with_preds['pct_profit'] = test_with_preds.apply(
            get_profit, axis=1).shift(-2)

    # filtering out rows where no arbitrage is predicted
    test_with_preds = test_with_preds[test_with_preds['pred'] != 0]

    # calculating mean profit where arbitrage predicted...
    pct_profit_mean = test_with_preds['pct_profit'].mean()

    # calculating median profit where arbitrage predicted...
    pct_profit_median = test_with_preds['pct_profit'].median()
    print('percent profit mean:', pct_profit_mean)
    print('percent profit median:', pct_profit_median, '\n\n')

    # save net performance to list
    performance_list.append([name, max_features, max_depth, n_estimators,
                             pct_profit_mean, pct_profit_median])
    ######################## END OF TODO ###########################
    
    
    

## Baseline

#### Logistic Regression Models

In [None]:
create_models(arb_data_paths=arb_data_paths, model_type=LogisticRegression(), features=features, param_grid={})

#### Random Forest Models w/ default parameters

In [None]:
create_models(arb_data_paths=arb_data_paths, model_type=LogisticRegression(), features=features, param_grid={})

#### Feature Importances

In [None]:
create_models(arb_data_paths=arb_data_paths, model_type=LogisticRegression(), features=features, param_grid={})

In [None]:

importances = pd.Series(xg.feature_importances_, X_train.columns)
n = 25
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='blue');

## Hyperparameter Tuning

In [11]:
param_grid = {
    'max_features': ['auto', 42, 44, 46],
    'n_estimators': [250, 300],
    'max_depth': [30, 35, 45, 50]
}

create_models(
    arb_data_paths=arb_data_paths, 
    model_type=LogisticRegression(), 
    features=features, param_grid
)