# Cryptolytic Data Processing

This notebook contains the code to generate the data that is used to create the arbitrage models in this [notebook](https://github.com/Cryptolytic-app/cryptolyticapp/blob/master/modeling/2_arbitrage_model_training.ipynb).

<img src="https://github.com/Cryptolytic-app/cryptolyticapp/blob/master/assets/cryptolytic_thumbnail.png?raw=true"
     alt="drawing"
     width="500"/>
     
#### What is arbitrage?
Arbitrage occurs when there is a price difference between the same asset in two different markets. So with crypto, it’s possible to have the same coin priced differently on separate exchanges. For example with bitcoin, you might have bitcoin priced at &#0036;8,000 on one exchange, and at the same time that bitcoin can be priced at &#0036;8,100 on another exchange. You can buy the bitcoin on the first exchange for &#0036;8,000, send it to the other exchange, and sell it for &#0036;8,100. Now you’ve made &#0036;100 in profit and you can repeat this process as long as that arbitrage opportunity lasts.

#### Background on arbitrage models
There are many different combinations of arbitrage that could be occuring at any given moment among all the cryptocurrency exchanges. Our goal was to capture as many of these as possible in order to create an API that provides predictions for any arbitrage opportunities that will occur in the next 10 mins. This API could then serve as the backend for a web application that displays the predictions in a more user-friendly format.

The arbitrage models in this notebook predict arbitrage 10 min before it happens, lasting for at least 30 mins. It's important that the arbitrage window lasts long enough because it takes time to move coins from one exchange to the other in order to successfully complete the arbitrage trades. The datasets used for modeling are generated by getting all of the combinations of 2 exchanges that support the same trading pair, engineering technical analysis features, merging that data on 'closing_time', engineering more features, and creating a target that signals an arbitrage opportunity. Arbitrage signals predicted by the models have a direction indicating which direction the arbitrage occurs in.

#### What kind of machine learning problem is this?
Arbitrage can occur in two directions, one from the the first exchange to the second and vice versa, so there are 3 possible classes for a target which makes this a multiclass classification problem.

#### Where does the data come from?
Data for this project was obtained through the APIs of each exchange.

<img src="https://github.com/Cryptolytic-app/cryptolyticapp/blob/master/modeling/assets/exchange_logos.png?raw=true"
     alt= "drawing"
     width="1000"/>
     
- [Bitfinex API OHLCV Data Documentation](https://docs.bitfinex.com/reference#rest-public-candles)
- [Coinbase Pro API OHLCV Data Documentation](https://docs.pro.coinbase.com/?r=1#get-historic-rates)
- [HitBTC OHLCV Data Documentation](https://api.hitbtc.com/#candles)
- [Kraken OHLCV Data Documentation](https://www.kraken.com/features/api)
- [Gemini OHLCV Data Documentation](https://docs.gemini.com/rest-api/)

The functions to collect the 5 min and 1 hour candlestick data can be found [here](https://github.com/Cryptolytic-app/cryptolyticapp/tree/master/data_collection_and_databasing).


#### What does the raw data look like?
<img src="https://github.com/Cryptolytic-app/cryptolyticapp/blob/master/modeling/assets/sample_df.png?raw=true"
     alt="drawing"
     width="400"/>
     
#### Data Dictionary
- **closing_time:** the closing time of the candlestick
- **open:** the price of the cryptocurrency at the opening of the candlestick
- **high:** the highest price of the cryptocurrency during that candlestick
- **low:** the lowest price of the cryptocurrency during that candlestick
- **close:** the price of the cryptocurrency at the end of the candlestick
- **volume:** the volume traded during that candlestick


#### Features Engineered
67 technical analysis features were engineered with the [Technical Analysis Library](https://github.com/bukosabino/ta). They fall into five categories:
- Momentum indicators
- Volume indicators
- Volatility indicators
- Trend indicators
- Others indicators

#### Merging Datasets
Since there are no extensive arbitrage datasets available covering all cryptocurrencies trading at major exchanges, we had to create our own by merging datasets that included the same trading pair at 2 exchanges and then create the target variable with more feature engineering. 

First, we genererated all of the possible arbitrage combinations between the 80 datasets that we collected through the exchange APIs and this resulted in 95 possible combinations of arbitrage datasets that could be used in modeling.

For each possible arbitrage combination, we merged the two datasets on 'closing_time' and created new features that identified arbitrage opportunities. These features included:
- **higher_closing_price:** identifies the exchange that has the higher closing price (1 or 2)
- **pct_higher:** the percentage by which higher_closing_price is greater
- **arbitrage_opportunity:** identfies if there is greater than 0.55% gain (to account for fees)
    - 1: arbitrage from exchange 1 to exchange 2
    - 0: no arbitrage
    - -1: arbitrage from exchange 2 to exchnage 1
- **close_exchange_1_shift:** shifts the exchange 1 close price to account for a 30 min trading interval + 10 min advance prediction
- **close_exchange_2_shift:** shifts the exchange 2 close price to account for a 30 min trading interval + 10 min advance prediction
- **window_length:** gets the window length of the arbitrage opportunity
- **window_length_shift:** shifts the window length by the 30 min trading interval - 10 mins
- **arbitrage_opportunity_shift:** shifts the arbitrage opportunity the the 30 min trading interval - 10 mins

Finally, we created the **target** using those features:

- 1: arbitrage from exchange 1 to exchange 2 starting 10 mins after prediction time, lasting 30 mins
- 0: no arbitrage 10 mins after prediction time
- -1: arbitrage from exchange 2 to exchnage 1 starting 10 mins after prediction time, lasting 30 mins
    
#### Class distribution
Since arbitrage is not always occuring, it's expected that target classes in these datasets will be imbalanced. Only datasets that contain the classes for arbitrage occuring in both directions would be suitable for machine learning so we filtered the 95 possible datasets down to the ones that contain at least 5% of each class. This resulted in 15 datasets that could be used for training models:

<img src="https://github.com/Cryptolytic-app/cryptolyticapp/blob/master/modeling/assets/class_dist.png?raw=true"
     alt="drawing"
     width="400"/>
     

#### Running this notebook 
Runtime approx. 12 hours

Data exported: 26gb

- 80 csv files in `data/raw_data/` (641 MB)
- 80 csv files in `data/ta_data/` (12.943 GB)
- 95 csv files in `data/arb_data/` (12.861 GB)
- 1 csv file in `data/` (4KB)
- 1 txt file in `data/` (4KB)


Notes:
- It is HIGHLY reccommended to use AWS Sagemaker to do feature engineering and split the work onto four notebooks. This will cut down the time to about 3 hours. You can do this by list slicing the filepaths of the datasets you are inputting into the functions (for example: `create_ta_csvs(csv_filepaths[:20]`)
- Since feature engineering takes a long time, we export data as csvs along each step to not have to re-engineer features in case the runtime restarts: one export after technical analysis features are added, and another export after datasets are merged and more arbitrage features are added.

#### Directory Structure
```
├── cryptolytic/                        <-- Root directory   
│   ├── modeling/                       <-- Directory for modeling work
│   │      │
│   │      ├──assets/                   <-- Directory with png assets used in notebooks
│   │      │
│   │      ├──data/                     <-- Directory containing all data for project
│   │      │   ├─ arb_data/             <-- Directory for train data after merging + FE pt.2
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ arb_preds_test_data/  <-- Directory for test data w/ predictions
│   │      │   │   └── *.csv 
│   │      │   │
│   │      │   ├─ arb_top_data/         <-- Directory for data from the best models
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ raw_data/             <-- Directory for raw training data
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ ta_data/              <-- Directory for csv files after FE pt.1 
│   │      │   │   └── *.csv
│   │      │   │
│   │      │   ├─ raw_zip_data/         <-- Directory containing zip files of raw data
│   │      │   │   └── *.zip
│   │      │   │
│   │      │   ├─ all_features.txt      <-- All features used in baseline models
│   │      │   │
│   │      │   ├─ top_features.txt      <-- Most important features for models
│   │      │   │
│   │      │   ├─ model_perf.csv        <-- Data from training baseline models and tuning
│   │      │   │
│   │      │   ├─ top_model_perf.csv    <-- Data from retraining and exporting best models
│   │      │
│   │      ├── models/                  <-- Directory for all pickle models
│   │      │      └── *.pkl
│   │      │
│   │      ├─ 1_arbitrage_data_processing.ipynb  <-- NB for data processing and creating csv
│   │      │
│   │      ├─ 2_arbitrage_modeling.ipynb         <-- NB for baseline models and tuning
│   │      │
│   │      ├─ 3_arbitrage_model_evaluation.ipynb <-- NB for model selection, eval, and viz
│   │      │
│   │      ├─ trade_recommender_models.ipynb     <-- NB for trade recommender models
│   │      │
│   │      ├─ environment.yml                    <-- Contains project dependencies
│   │      │
│   │      ├─ utils.py                           <-- All the functions used in modeling
│   │      │

```

# Imports

This project uses conda to manage environments.

In [3]:
# to update your conda env from a yml file from terminal
# conda env update --file modeling/environment.yml

# to export yml from terminal
# conda env export > modeling/environment.yml

In [3]:
import glob
import os
import pickle
import json
import itertools
from zipfile import ZipFile
import warnings

warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import datetime as dt
pd.set_option('display.max_rows', 100000)

from ta import add_all_ta_features

# Data

In this section we'll be unzipping the files that contain the 5 minute candle data, saving them to a new directory `raw_data/`, and renaming the filepaths that contain coinbase_pro to cbpro (the underscore causes problems with naming conventions later). You must start with the zip data in the `data/zip_raw_data/` directory which can be found on [Github](https://github.com/Cryptolytic-app/cryptolyticapp/tree/master/modeling/data/zip_raw_data).

#### Unzipping

In [44]:
zip_filepaths = glob.glob('data/zip_raw_data/*300.zip')
len(zip_filepaths) #5

5

In [4]:
# UNCOMMENT THIS TO UNZIP

# for zip_filepath in zip_filepaths:
#     with ZipFile(zip_filepath, 'r') as zip_ref:
#         zip_ref.extractall('data/csv_data')

#### Renaming

In [5]:
csv_filepaths = glob.glob('data/raw_data/*.csv')
len(csv_filepaths) #80

80

In [None]:
# UNCOMMENT THIS TO RENAME

# for filepath in csv_filepaths:
#     new_filepath = filepath.replace('coinbase_pro', 'cbpro')
#     os.rename(filepath, new_filepath)

#### Raw datasets look like this

In [60]:
pd.read_csv('data/raw_data/bitfinex_ltc_usd_300.csv', index_col=0).head()

Unnamed: 0,closing_time,open,high,low,close,base_volume
0,1417990200,3.74,3.74,3.738,3.738,26.698132
1,1417990500,3.7254,3.7254,3.725,3.725,6.469
2,1417990800,3.7221,3.7221,3.7221,3.7221,15.0
3,1417991100,3.7161,3.7161,3.7161,3.7161,1.061168
4,1417991700,3.722,3.722,3.722,3.722,4.650807


#### And we want to transform it into something that looks like this for modeling...

In [61]:
pd.read_csv('data/arb_data/cbpro_bitfinex_ltc_usd.csv', index_col=0).head()

Unnamed: 0,open_exchange_1,high_exchange_1,low_exchange_1,close_exchange_1,base_volume_exchange_1,nan_ohlcv_exchange_1,volume_adi_exchange_1,volume_obv_exchange_1,volume_cmf_exchange_1,volume_fi_exchange_1,...,year,month,day,higher_closing_price,pct_higher,arbitrage_opportunity,window_length,arbitrage_opportunity_shift,window_length_shift,target
0,3.7,3.7,3.7,3.7,5.0,0.0,19.605746,0.0,0.0,0.0,...,2016,8,17,1,3.064067,-1,5,-1.0,40.0,-1
1,3.7,3.7,3.7,3.7,5.0,0.0,0.0,0.0,0.0,0.0,...,2016,8,17,1,3.064067,-1,10,-1.0,45.0,-1
2,3.7,3.7,3.7,3.7,0.0,1.0,0.0,0.0,0.0,-0.0,...,2016,8,17,1,3.064067,-1,15,-1.0,50.0,-1
3,3.7,3.7,3.7,3.7,0.0,1.0,0.0,0.0,0.0,-0.0,...,2016,8,17,1,3.064067,-1,20,-1.0,55.0,-1
4,3.7,3.7,3.7,3.7,0.0,1.0,0.0,0.0,0.0,0.0,...,2016,8,17,1,3.064067,-1,25,-1.0,60.0,-1


# Functions

#### Arbitrage Combinations

In [96]:
# function to create pairs for arbitrage datasets
def get_file_pairs(filenames):
    """
    This function takes in a list of exchanges and looks through data
    directories to find all possible combinations for 2 exchanges
    with the same trading pair. Returns a list of all lists that
    include the file pairs
    """
    # get combinations
    combos = list(itertools.combinations(filenames, 2))
    
    # remove unmatched trading pairs
    filtered_combos = []
    for combo in combos:
        tp1 = '_'.join(combo[0].split('/')[2].split('_')[1:3])
        tp2 = '_'.join(combo[1].split('/')[2].split('_')[1:3])
        if tp1 == tp2:
            filtered_combos.append(combo)
                    
    return filtered_combos

In [97]:
pairs = get_file_pairs(csv_filepaths)
print(len(pairs)) # 95

95


#### OHLCV Data Resampling

In [98]:
def resample_ohlcv(df, period):
    """ 
    Changes the time period on cryptocurrency ohlcv data.
    Period is a string denoted by '{time_in_minutes}T'
    (ex: '1T', '5T', '60T').
    """

    # set date as index
    # needed for the function to run
    df = df.set_index(['date'])

    # aggregation function
    ohlc_dict = {
        'open':'first',                                                                                                    
        'high':'max',                                                                                                       
        'low':'min',                                                                                                        
        'close': 'last',                                                                                                    
        'base_volume': 'sum'
    }

    # resample
    df = df.resample(period, how=ohlc_dict, closed='left', label='left')
    
    return df

#### Filling NaNs

Resample_ohlcv function will create NaNs in df where there were gaps in the data. The gaps could be caused by exchanges being down, errors from cryptowatch or errors from the exchanges themselves.

In [None]:
def fill_nan(df):
    """
    Iterates through a dataframe and fills NaNs with appropriate 
    open, high, low, close values.
    """
    # Forward fill close column
    df['close'] = df['close'].ffill()

    # Backward fill the open, high, low rows with the close value
    df = df.bfill(axis=1)

    return df

#### Feature engineering - before merge

In [None]:
def engineer_features(df, period='5T'):
    """
    Takes a df, engineers ta features, and returns a df
    with period=['5T']
    """
    # convert unix closing_time to datetime
    df['date'] = pd.to_datetime(df['closing_time'], unit='s')
    
    # time resampling to fill gaps in data
    df = resample_ohlcv(df, period)
    
    # move date off the index
    df = df.reset_index()
    
    # create closing_time
    closing_time = df.date.values
    df.drop(columns='date', inplace=True)
    
    # create feature to indicate where rows were gaps in data
    df['nan_ohlcv'] = df['close'].apply(lambda x: 1 if pd.isnull(x) else 0)
    
    # fill gaps in data
    df = fill_nan(df)

    # adding all the technical analysis features...
    df = add_all_ta_features(df, 'open', 'high', 'low', 'close','base_volume', fillna=True)
    
    # add closing time column
    df['closing_time'] = closing_time
    
    return df

#### Feature Engineering - after merge

In [2]:
def get_higher_closing_price(df):
    """
    Returns the exchange with the higher closing price
    """
    
    # exchange 1 has higher closing price
    if (df['close_exchange_1'] - df['close_exchange_2']) > 0:
        return 1
    
    # exchange 2 has higher closing price
    elif (df['close_exchange_1'] - df['close_exchange_2']) < 0:
        return 2
    
    # closing prices are equivalent
    else:
        return 0

def get_pct_higher(df):
    """
    Returns the percentage of the difference between ex1/ex2 
    closing prices
    """
    
    # if exchange 1 has a higher closing price than exchange 2
    if df['higher_closing_price'] == 1:
        
        # % difference
        return ((df['close_exchange_1'] / 
                 df['close_exchange_2'])-1)*100
    
    # if exchange 2 has a higher closing price than exchange 1
    elif df['higher_closing_price'] == 2:
        
        # % difference
        return ((df['close_exchange_2'] / 
                 df['close_exchange_1'])-1)*100
    
    # if closing prices are equivalent
    else:
        return 0

def get_arb_opportunity(df):
    """
    Return available arbitrage opportunities
    
    1: arbitrage from exchange 1 to exchange 2
    0: no arbitrage
    -1: arbitrage from exchange 2 to exchange 1
    """
    
    # assuming the total fees are 0.55%, if the higher closing price 
    # is less than 0.55% higher than the lower closing price
    if df['pct_higher'] < .55:
        return 0 # no arbitrage
    
    # if exchange 1 closing price is more than 0.55% higher
    # than the exchange 2 closing price
    elif df['higher_closing_price'] == 1:
        return -1 # arbitrage from exchange 2 to exchange 1
    
    # if exchange 2 closing price is more than 0.55% higher
    # than the exchange 1 closing price
    elif df['higher_closing_price'] == 2:
        return 1 # arbitrage from exchange 1 to exchange 2

def get_window_length(df):
    """
    Creates a column 'window_length' to show how long an arbitrage 
    opportunity has lasted
    """
    
    # convert arbitrage_opportunity column to a list
    target_list = df['arbitrage_opportunity'].to_list()
    
    # set initial window length 
    window_length = 5 # time in minutes
    
    # list for window_lengths
    window_lengths = []
    
    # iterate through arbitrage_opportunity column
    for i in range(len(target_list)):
        
        # check if a value in the arbitrage_opportunity column is 
        # equal to the previous value in the arbitrage_opportunity 
        # column and increase window length
        if target_list[i] == target_list[i-1]:
            window_length += 5
            window_lengths.append(window_length)
            
        # if a value in the arbitrage_opportunity column is
        # not equal to the previous value in the arbitrage_opportunity 
        # column reset the window length to five minutes
        else:
            window_length = 5
            window_lengths.append(window_length)
            
    # create window length column showing how long an arbitrage 
    # opportunity has lasted
    df['window_length'] = window_lengths

    return df
        

def merge_dfs(df1, df2):
    """
    Merges two dataframes and adds final features for arbitrage data
    
    Returns a dataframe with new features:
    - year
    - month
    - day
    - higher_closing_price
    - pct_higher
    - arbitrage_opportunity
    - window_length
    """
    df = pd.merge(df1, df2, on='closing_time',
                  suffixes=('_exchange_1', '_exchange_2'))

    # convert closing_time to datetime
    df['closing_time'] = pd.to_datetime(df['closing_time']) 

    # Create additional date features.
    df['year'] = df['closing_time'].dt.year
    df['month'] = df['closing_time'].dt.month
    df['day'] = df['closing_time'].dt.day
    print('before get higher closing price', df.shape)
    
    # get higher_closing_price feature to create pct_higher feature
    df['higher_closing_price'] = df.apply(get_higher_closing_price, axis=1)
    
    # get pct_higher feature to create arbitrage_opportunity feature
    df['pct_higher'] = df.apply(get_pct_higher, axis=1)
    
    # create arbitrage_opportunity feature
    df['arbitrage_opportunity'] = df.apply(get_arb_opportunity, axis=1)
    
    # create window_length feature
    df = get_window_length(df)
    
    return df

#### Creating the target

In [None]:
# specifying arbitrage window length to target, in minutes
interval = 30

def get_target_value(df, interval=30):
    """
    Checks for arbitrage opportunities and returns a target. 
    
    Classes:
    - 1: arbitrage from exchange 1 to exchange 2 starting 10 mins after prediction time, lasting 30 mins
    - 0: no arbitrage 10 mins after prediction time
    - -1: arbitrage from exchange 2 to exchnage 1 starting 10 mins after prediction time, lasting 30 mins
    """
    
    # if the arbitrage window is as long as the targeted interval
    if df['window_length_shift'] >= interval:
        # if that window is for exchange 1 to 2
        if df['arbitrage_opportunity_shift'] == 1:
            return 1 # arbitrage from exchange 1 to 2
        
        # if that window is for exchange 2 to 1
        elif df['arbitrage_opportunity_shift'] == -1:
            return -1 # arbitrage from exchange 2 to 1
        
        # if no arbitrage opportunity
        elif df['arbitrage_opportunity_shift'] == 0:
            return 0 # no arbitrage opportunity
        
    # if the arbitrage window is less than our targeted interval
    else:
        return 0 # no arbitrage opportunity
    

def get_target(df, interval=interval):
    """
    Create new features and target.
    
    Returns a dataframe with new features:
    - arbitrage_opportunity_shift
    - window_length_shift
    - target
    """
    
    # used to shift rows
    # assumes candle length is five minutes, interval is 30 mins
    rows_to_shift = int(-1*(interval/5)) # -7
    
    # arbitrage_opportunity feature, shifted by length of targeted interval
    # minus one to predict ten minutes in advance rather than five
    df['arbitrage_opportunity_shift'] = df['arbitrage_opportunity'].shift(
        rows_to_shift - 1)
    
    # window_length feature, shifted by length of targeted interval minus one
    # to predict ten minutes
    df['window_length_shift'] = df['window_length'].shift(rows_to_shift - 1)
    
    # creating target column; this will indicate if an arbitrage opportunity
    # that lasts as long as the targeted interval is forthcoming
    df['target'] = df.apply(get_target_value, axis=1)
    
    # dropping rows where target could not be calculated due to shift
    df = df[:rows_to_shift - 1] # -7
    
    return df

def get_close_shift(df, interval=interval):
    """
    Shifts the closing prices by the selected interval +
    10 mins.
    
    Returns a df with new features:
    - close_exchange_1_shift
    - close_exchange_2_shift
    """
    
    rows_to_shift = int(-1*(interval/5))
    
    df['close_exchange_1_shift'] = df['close_exchange_1'].shift(
        rows_to_shift - 2)
    
    df['close_exchange_2_shift'] = df['close_exchange_2'].shift(
        rows_to_shift - 2)
    
    return df

def get_profit(df):
    """
    Calculates the profit of an arbitrage trade and returns df 
    with new profit feature.
    """
    
    # if exchange 1 has the higher closing price
    if df['higher_closing_price'] == 1:
        
        # return how much money you would make if you bought 
        # on exchange 2, sold on exchange 1, and took account 
        # of 0.55% fees
        return (((df['close_exchange_1_shift'] / 
                 df['close_exchange_2'])-1)*100)-.55
    
    # if exchange 2 has the higher closing price
    elif df['higher_closing_price'] == 2:
        
        # return how much money you would make if you bought 
        # on exchange 1, sold on exchange 2, and took account 
        # of 0.55% fees
        return (((df['close_exchange_2_shift'] / 
                 df['close_exchange_1'])-1)*100)-.55
    
    # if the closing prices are the same
    else:
        return 0 # no arbitrage

#### Split names when in the format exchange_trading_pair

In [None]:
def get_exchange_trading_pair(ex_tp):
    """
    Splits exchange_trading_pair into separate variables.
    """

    ex_tp = ex_tp.split('/')[2]
    exchange = ex_tp.split('_')[0]
    trading_pair = '_'.join(ex_tp.split('_')[1:3])
        
    return exchange, trading_pair

### Generate all individual csv's with ta data (~2 hours)

Notes:
- create a `/ta_data` directory before running this function
- this function takes a really long time to run so it's recommended to run in sagemaker and divide the pairs in to 4 notebooks so you're running about 20 pairs in each notebook. Should take less than an hour if split up on 4 notebooks.

In [None]:
def create_ta_csvs(csv_filepaths):
    """
    Takes a csv filename, creates a dataframe, engineers 
    features, and saves it as a new csv in /ta_data.
    """
    # counter
    n = 1
    
    for file in csv_filepaths:
        
        # create df
        df = pd.read_csv(file, index_col=0)
        
        # define period
        period = '5T'
        
        # engineer features
        df = engineer_features(df, period)
        print('features engineered')
        
        # generate new filename
        filename = 'data/ta_data/' + file.split('/')[2][:-4] + '_ta.csv'
        
        # export csv
        df.to_csv(filename)
        
        # print progress
        print(f'csv #{n} saved :)')
        
        # update counter
        n += 1

In [None]:
create_ta_csvs(csv_filepaths)

### Generate all arbitrage training data csv's (~9 hrs)

Notes:
- create a `/arb_data` directory before running this function
- this function takes a really long time to run so it's recommended to run in sagemaker and divide the pairs in to 4 notebooks so you're running about 20 pairs in each notebook. Should take ~2-3 hours if split up on 4 notebooks.

In [None]:
def create_arb_csvs(pairs):
    """Takes a list of possible arbitrage combinations, finds the 
        appropriate datasets in /ta_data, loads datasets, merges them,
        engineers more features, creates a target and exports the new
        dataset as a csv"""
    
    # counter
    n = 0
    
    # iterate through arbitrage combinations
    for pair in pairs:
        
        # define paths for the csv
        csv_1 = 'data/ta_data/' + pair[0].split('/')[2][:-4] + '_ta.csv'
        csv_2 = 'data/ta_data/' + pair[1].split('/')[2][:-4] + '_ta.csv'
        print('csv1, csv2:', csv_1, csv_2)
        
        # define exchanges and trading_pairs
        ex_tp_1, ex_tp_2 = pair[0][:-8], pair[1][:-8]
        print(ex_tp_1)
        ex1, tp1 = get_exchange_trading_pair(ex_tp_1)
        ex2, tp2 = get_exchange_trading_pair(ex_tp_2)
        print(ex1, tp1,  ex2, tp2)
        
        # define model_name for the filename
        model_name = ex1 + '_' + ex_tp_2.split('/')[2]
        print(model_name)
          
        # create dfs from csv's that already include ta features
        df1, df2 = pd.read_csv(csv_1, index_col=0), pd.read_csv(csv_2, index_col=0)       
        print('df 1 shape: ', df1.shape, 'df 2 shape: ', df2.shape)

        # merge dfs
        df = merge_dfs(df1, df2)
        print('dfs merged')
        print('merged df shape:' , df.shape)
        

        # create target 
        df = get_target(df)
        print(model_name, ' ', df.shape)

        # export csv
        path = 'data/arb_data/'
        csv_filename = path + model_name + '.csv'
        df.to_csv(csv_filename)

        # print progress
        print(f'csv #{n} saved :)')

        # update counter
        n += 1


In [None]:
create_arb_csvs(pairs)

#### Check to make sure all csv's saved

In [7]:
arb_data_paths = glob.glob('data/arb_data/*.csv')
print(len(arb_data_paths)) # 95

95


## Multi-Class Distribution
The final arbitrage datasets that will be used during modeling contain 3 classes:
- 1: arbitrage from exchange 1 to exchange 2 starting 10 mins after prediction time, lasting 30 mins
- 0: no arbitrage 10 mins after prediction time
- -1: arbitrage from exchange 2 to exchnage 1 starting 10 mins after prediction time, lasting 30 mins

We need to look at the class distribution to understand if these datasets contain enough of the 1 and -1 class to make arbitrage predictions.

In [8]:
def class_distribution(arb_data_paths):
    """
    Returns a df of the class distribution for all arbitrage
    datasets
    """
    dist_df = pd.DataFrame()
    
    for path in arb_data_paths:
        df = pd.read_csv(path, index_col=0)
        arbitrage_combination = path.split('/')[2][:-4]
        value_counts = df.target.value_counts()
        
        # not every dataset has all 3 classes so a conditional
        # is required to deal with each case
        if len(value_counts) == 1: # just 0
            no_arb = round(value_counts[0] / value_counts.sum(), 2)
            ex1_to_ex2_arb = 0
            ex2_to_ex1_arb = 0
        elif len(value_counts) == 2:
            # if 0, -1
            if value_counts.index[0] == -1 or value_counts.index[1] == -1:
                no_arb = round(value_counts[0] / value_counts.sum(), 2)
                ex1_to_ex2_arb = 0
                ex2_to_ex1_arb = round(value_counts[-1] / value_counts.sum(), 2)
            # if 0, 1
            else:
                no_arb = round(value_counts[0] / value_counts.sum(), 2)
                ex1_to_ex2_arb = round(value_counts[1] / value_counts.sum(), 2)
                ex2_to_ex1_arb = 0
        else: # has all classes
            no_arb = round(value_counts[0] / value_counts.sum(), 2)
            ex1_to_ex2_arb = round(value_counts[1] / value_counts.sum(), 2)
            ex2_to_ex1_arb = round(value_counts[-1] / value_counts.sum(), 2)

        dist_df = dist_df.append({
            'arbitrage_combination': arbitrage_combination,
            'ex1_to_ex2_arb': ex1_to_ex2_arb,
            'ex2_to_ex1_arb': ex2_to_ex1_arb,
            'no_arb': no_arb
        }, ignore_index=True)
        
    return dist_df

In [18]:
dist_df = class_distribution(arb_data_paths)
dist_df.head()

Unnamed: 0,arbitrage_combination,ex1_to_ex2_arb,ex2_to_ex1_arb,no_arb
0,kraken_bitfinex_bch_btc,0.0,0.0,1.0
1,kraken_gemini_eth_btc,0.01,0.01,0.98
2,bitfinex_gemini_btc_usd,0.0,0.0,1.0
3,cbpro_gemini_bch_btc,0.15,0.16,0.7
4,bitfinex_kraken_btc_usd,0.0,0.0,1.0


To see the datasets that contain the most arbitrage, let's sort by no_arb...

In [63]:
dist_df = dist_df.sort_values(by='no_arb').head(10)
dist_df

Unnamed: 0,arbitrage_combination,ex1_to_ex2_arb,ex2_to_ex1_arb,no_arb
71,kraken_cbpro_etc_usd,0.01,0.6,0.39
7,bitfinex_kraken_etc_usd,0.59,0.01,0.39
39,bitfinex_cbpro_bch_usd,0.03,0.43,0.54
37,bitfinex_cbpro_zrx_usd,0.01,0.45,0.54
46,bitfinex_cbpro_etc_usd,0.07,0.36,0.57
60,cbpro_bitfinex_ltc_usd,0.24,0.12,0.64
69,hitbtc_cbpro_eth_usdc,0.17,0.17,0.66
12,cbpro_bitfinex_eth_usd,0.21,0.12,0.67
56,bitfinex_hitbtc_bch_usdt,0.19,0.14,0.67
92,gemini_bitfinex_bch_btc,0.16,0.16,0.68


These are the datasets that will be most suitable for training arbitrage models, but we need to also be considerate of creating models that will predict arbitrage accurately in both directions. While the first 4 datasets above contain the most arbitrage occurences to train on, they only contain a high number of arbitrage in one direction. Those datasets aren't suitable for modeling, and neither are the ones at the end of this list because they contain almost no arbitrage at all. We can filter this dataframe for the datasets that contain at least 5% arbitrage in each direction to get our datasets that will be used in modeling.

In [41]:
dist_df_filtered = dist_df[(dist_df['ex1_to_ex2_arb'] > 0.05) & (dist_df['ex2_to_ex1_arb'] > 0.05)]
dist_df_filtered

Unnamed: 0,arbitrage_combination,ex1_to_ex2_arb,ex2_to_ex1_arb,no_arb
46,bitfinex_cbpro_etc_usd,0.07,0.36,0.57
60,cbpro_bitfinex_ltc_usd,0.24,0.12,0.64
69,hitbtc_cbpro_eth_usdc,0.17,0.17,0.66
12,cbpro_bitfinex_eth_usd,0.21,0.12,0.67
56,bitfinex_hitbtc_bch_usdt,0.19,0.14,0.67
92,gemini_bitfinex_bch_btc,0.16,0.16,0.68
23,gemini_hitbtc_bch_btc,0.16,0.14,0.69
3,cbpro_gemini_bch_btc,0.15,0.16,0.7
47,kraken_gemini_bch_btc,0.14,0.14,0.71
73,bitfinex_cbpro_btc_usd,0.08,0.19,0.73


#### Export final arbitrage training dataset paths to a txt file

In [67]:
data = dist_df_filtered['arbitrage_combination'].to_list()
train_data_paths = [f'data/arb_data/{d}.csv' for d in data]

with open('data/train_data_paths.txt', 'wb') as fp:
    pickle.dump(train_data_paths, fp)

print(f'Exported data: \n\n {train_data_paths}')

Exported data: 

 ['data/arb_data/bitfinex_cbpro_etc_usd.csv', 'data/arb_data/cbpro_bitfinex_ltc_usd.csv', 'data/arb_data/hitbtc_cbpro_eth_usdc.csv', 'data/arb_data/cbpro_bitfinex_eth_usd.csv', 'data/arb_data/bitfinex_hitbtc_bch_usdt.csv', 'data/arb_data/gemini_bitfinex_bch_btc.csv', 'data/arb_data/gemini_hitbtc_bch_btc.csv', 'data/arb_data/cbpro_gemini_bch_btc.csv', 'data/arb_data/kraken_gemini_bch_btc.csv', 'data/arb_data/bitfinex_cbpro_btc_usd.csv', 'data/arb_data/gemini_kraken_ltc_btc.csv', 'data/arb_data/bitfinex_hitbtc_ltc_usdt.csv', 'data/arb_data/gemini_bitfinex_ltc_btc.csv', 'data/arb_data/gemini_hitbtc_ltc_btc.csv', 'data/arb_data/gemini_cbpro_ltc_btc.csv']


## Continued...
Model training will be carried out in the following notebook:
- [View on Github](https://github.com/Cryptolytic-app/cryptolyticapp/blob/master/modeling/2_arbitrage_model_training.ipynb)

- [Jump to local copy](2_arbitrage_model_training.ipynb)