# Contents

1. Setup
2. Read in the variables from the YAML file
3. Read in the processed data
4. Assertain the most important features in our data
5. Analyse feature importance

# --------------------------------------------------------------------------------------------------------
# 1. Setup
# --------------------------------------------------------------------------------------------------------

### Import relevant packages

##### General packages

In [12]:
import os
import pandas as pd
from tqdm.auto import tqdm
import numpy as np

##### Packages for the YAMl file

In [2]:
import yaml

### Set the maximum output number of columns and rows

In [3]:
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

### Load in the '.py' file containg the functions used in this notebook

In [4]:
%load_ext autoreload
%autoreload 1
%aimport feature_importance

import feature_importance as fi

# --------------------------------------------------------------------------------------------------------
# 2. Read in the variables from the YAML file
# --------------------------------------------------------------------------------------------------------

### Read in YAML configuration file

In [5]:
with open("../../Config_files/config.yaml", "r") as variables:
    config_variables = yaml.load(variables, Loader=yaml.FullLoader)

### Set the needed variables from this config file

In [6]:
data_directory = config_variables["data_directory"]

In [7]:
train_intervals_list = config_variables["train_intervals_list"]

In [8]:
shift_periods_in_days_list = config_variables["shift_periods_in_days_list"]

In [9]:
random_seed = config_variables["random_seed"]

# --------------------------------------------------------------------------------------------------------
# 3. Read in the processed data
# --------------------------------------------------------------------------------------------------------

At this stage, all the data is collected, processed, and stored in a CSV.

There is one entry in this data for each day between the start and end date. This start date would have been specified in the *scrape_config.yaml* file and would have been used to dictate the dates scraped in the data collection step.

In the case when data for any of the features didn't exist before a certain date, the values for this feature before this date will apppear as *NaN*. These dates can be easily seen in the data_processing.ipynb notebook and they should be used to dictate the date interval specified to use in the modelling process. The dates for these chosen intervals can be specified in the config.yaml file under the variable train_intervals_list.

### Read in the data

In [13]:
processed_data = pd.read_csv(os.path.join(data_directory, 'Processed_data', 'full_processed_data.csv'))

### Manually inspect this data

In [14]:
processed_data.shape

(4715, 1429)

In [15]:
processed_data.head()

Unnamed: 0,date,trading_volume,trades_per_minute,volatility,bid_and_ask_spread,bid_and_ask_sum_asks,bid_and_ask_sum_bids,time_between_blocks,block_size_votes,rank_bit_x,rank_bitfinex,rank_bitstamp,rank_btce,rank_coinbase,rank_itbit,rank_kraken,rank_mtgox,rank_okcoin,rank_others,marketcap3sma,marketcap7sma,marketcap14sma,marketcap30sma,marketcap90sma,marketcap3ema,marketcap7ema,marketcap14ema,marketcap30ema,marketcap90ema,marketcap3wma,marketcap7wma,marketcap14wma,marketcap30wma,marketcap90wma,marketcap3trx,marketcap7trx,marketcap14trx,marketcap30trx,marketcap90trx,marketcap3mom,marketcap7mom,marketcap14mom,marketcap30mom,marketcap90mom,marketcap3std,marketcap7std,marketcap14std,marketcap30std,marketcap90std,marketcap3var,...,us_5_yr_5_yr_fwd_inflatn_expectation,us_ted_spread,us_prime_loan_rate,us_unemployment_rate,us_long_natural_unemployment_rate,us_short_natural_unemployment_rate,us_labour_force_employ_rate,us_pop_employ_rate,us_num_unemployed,us_nonfarm_num_employed,us_num_employed_in_manufactoring,us_num_filing_for_unemployment,us_median_house_income,us_total_real_disposable_income,us_tot_personal_consumption_spend,us_tot_personal_consumption_spend_dg,us_percent_personal_saving_rate,us_real_retail_and_food_sales,us_total_disposble_income,us_industry_production_index,us_capacity_utilisation,us_new_housing_devs_started,us_gross_private_domestic_investment,us_corporate_profit_aftr_tax,us_financial_stress_index,west_texas_crude_oil_price,us_leading_index,us_currency_trade_weighted_dollar_index,us_broad_trade_weighted_dollar_index,us_total_public_debt,us_public_debt_as_perc_of_gdp,us_bank_excess_capital_reserves,us_total_commercial_loans,us_10_year_yield,us_5_year_yield,us_3_year_yield,us_2_year_yield,musk_num_tweets,musk_num_pos_tweets,musk_num_neg_tweets,musk_num_neut_tweets,musk_percent_pos,musk_percent_neg,musk_percent_neut,cos_dotw,sin_dotw,cos_dotm,sin_dotm,shifted_1,binary_price_change_1
0,2009-01-01,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,3.25,7.8,4.875262,5.212292,65.7,60.6,12058.0,134055.0,12561.0,,60200.0,11718.0,9847.2,1023.0,6.2,158979.0,10921.5,88.583,69.819,490.0,2014.878,1082.708,,,-2.32,,,11126941.0,77.10496,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.841254,0.540641,,0
1,2009-01-02,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,0.59,1.33,3.25,7.8,4.875262,5.212292,65.7,60.6,12058.0,134055.0,12561.0,,60200.0,11718.0,9847.2,1023.0,6.2,158979.0,10921.5,88.583,69.819,490.0,2014.878,1082.708,3.246,46.17,-2.32,79.257,107.2518,11126941.0,77.10496,,,2.9618,1.768,1.0916,0.7202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.8660254,0.841254,0.540641,,0
2,2009-01-03,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,0.59,1.33,3.25,7.8,4.875262,5.212292,65.7,60.6,12058.0,134055.0,12561.0,503000.0,60200.0,11718.0,9847.2,1023.0,6.2,158979.0,10921.5,88.583,69.819,490.0,2014.878,1082.708,3.246,46.17,-2.32,79.257,107.2518,11126941.0,77.10496,,,2.9618,1.768,1.0916,0.7202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.5,0.8660254,0.841254,0.540641,,0
3,2009-01-04,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,0.59,1.33,3.25,7.8,4.875262,5.212292,65.7,60.6,12058.0,134055.0,12561.0,503000.0,60200.0,11718.0,9847.2,1023.0,6.2,158979.0,10921.5,88.583,69.819,490.0,2014.878,1082.708,3.246,46.17,-2.32,79.257,107.2518,11126941.0,77.10496,,,2.9618,1.768,1.0916,0.7202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,1.224647e-16,0.841254,0.540641,,0
4,2009-01-05,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,0.49,1.28,3.25,7.8,4.875262,5.212292,65.7,60.6,12058.0,134055.0,12561.0,503000.0,60200.0,11718.0,9847.2,1023.0,6.2,158979.0,10921.5,88.583,69.819,490.0,2014.878,1082.708,3.246,48.61,-2.32,80.0914,107.5888,11126941.0,77.10496,,,3.013,1.7816,1.0812,0.7011,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.5,-0.8660254,0.841254,0.540641,,0


In [16]:
processed_data.tail()

Unnamed: 0,date,trading_volume,trades_per_minute,volatility,bid_and_ask_spread,bid_and_ask_sum_asks,bid_and_ask_sum_bids,time_between_blocks,block_size_votes,rank_bit_x,rank_bitfinex,rank_bitstamp,rank_btce,rank_coinbase,rank_itbit,rank_kraken,rank_mtgox,rank_okcoin,rank_others,marketcap3sma,marketcap7sma,marketcap14sma,marketcap30sma,marketcap90sma,marketcap3ema,marketcap7ema,marketcap14ema,marketcap30ema,marketcap90ema,marketcap3wma,marketcap7wma,marketcap14wma,marketcap30wma,marketcap90wma,marketcap3trx,marketcap7trx,marketcap14trx,marketcap30trx,marketcap90trx,marketcap3mom,marketcap7mom,marketcap14mom,marketcap30mom,marketcap90mom,marketcap3std,marketcap7std,marketcap14std,marketcap30std,marketcap90std,marketcap3var,...,us_5_yr_5_yr_fwd_inflatn_expectation,us_ted_spread,us_prime_loan_rate,us_unemployment_rate,us_long_natural_unemployment_rate,us_short_natural_unemployment_rate,us_labour_force_employ_rate,us_pop_employ_rate,us_num_unemployed,us_nonfarm_num_employed,us_num_employed_in_manufactoring,us_num_filing_for_unemployment,us_median_house_income,us_total_real_disposable_income,us_tot_personal_consumption_spend,us_tot_personal_consumption_spend_dg,us_percent_personal_saving_rate,us_real_retail_and_food_sales,us_total_disposble_income,us_industry_production_index,us_capacity_utilisation,us_new_housing_devs_started,us_gross_private_domestic_investment,us_corporate_profit_aftr_tax,us_financial_stress_index,west_texas_crude_oil_price,us_leading_index,us_currency_trade_weighted_dollar_index,us_broad_trade_weighted_dollar_index,us_total_public_debt,us_public_debt_as_perc_of_gdp,us_bank_excess_capital_reserves,us_total_commercial_loans,us_10_year_yield,us_5_year_yield,us_3_year_yield,us_2_year_yield,musk_num_tweets,musk_num_pos_tweets,musk_num_neg_tweets,musk_num_neut_tweets,musk_percent_pos,musk_percent_neg,musk_percent_neut,cos_dotw,sin_dotw,cos_dotm,sin_dotm,shifted_1,binary_price_change_1
4710,2021-11-24,5661.80646,119.848889,112.522703,1.513739,32040000.0,100236800.0,8.141667,20.0,0.021572,2058.1255,366.272559,1198.635333,1140.771794,251.796533,594.514759,23.146947,1123.792028,161.251842,1076167000000.0,1092176000000.0,1143873000000.0,1162191000000.0,1023935000000.0,1078588000000.0,1098451000000.0,1125758000000.0,1135610000000.0,1043831000000.0,1074257000000.0,1086328000000.0,1114505000000.0,1152111000000.0,1090419000000.0,-0.862,-0.77,-0.064,0.429,0.218,-44460430000.0,-58211560000.0,-189142500000.0,-108948300000.0,181466300000.0,9849844000.0,36230020000.0,118335200000.0,106918900000.0,289326100000.0,2.425485e+19,...,2.19,0.11,3.25,4.6,4.450651,4.450651,61.6,58.8,7419.0,148319.0,12529.0,199000.0,67521.0,15425.2,16290.7,2076.5,7.3,230623.0,18108.3,101.611,76.3674,1520.0,4090.667,2742.138,-0.07,76.74,1.72,90.8221,128.0097,28529436.0,125.45397,2854690.0,2436.4397,1.755,1.0914,0.7936,0.5963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-2.449294e-16,1.0,-1.133108e-15,58323.714436,1
4711,2021-11-25,5057.482717,112.791806,108.65499,1.540022,36687290.0,100637200.0,8.141667,20.0,0.020861,1931.775241,369.658249,1198.635333,1351.680054,251.796533,467.78348,23.146947,1123.792028,158.377601,1081359000000.0,1089876000000.0,1134865000000.0,1159615000000.0,1026175000000.0,1088573000000.0,1098478000000.0,1122131000000.0,1133220000000.0,1045034000000.0,1085452000000.0,1087923000000.0,1108463000000.0,1148005000000.0,1092059000000.0,-0.407,-0.747,-0.113,0.408,0.22,15574220000.0,-16095080000.0,-126119000000.0,-77275720000.0,201645400000.0,24407430000.0,32031830000.0,111354600000.0,109179600000.0,288478600000.0,1.489306e+20,...,2.19,0.11,3.25,4.6,4.450651,4.450651,61.6,58.8,7419.0,148319.0,12529.0,199000.0,67521.0,15425.2,16290.7,2076.5,7.3,230623.0,18108.3,101.611,76.3674,1520.0,4090.667,2742.138,-0.07,76.74,1.72,90.8221,128.0097,28529436.0,125.45397,2854690.0,2436.4397,1.755,1.0914,0.7936,0.5963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,-1.133108e-15,55945.976834,0
4712,2021-11-26,12721.213799,213.395278,154.675175,1.920435,36687290.0,100637200.0,8.141667,20.0,0.02021,1705.968533,385.889296,1198.635333,1351.680054,251.796533,479.332281,23.146947,1123.792028,151.294237,1075359000000.0,1086558000000.0,1123807000000.0,1157380000000.0,1027699000000.0,1072286000000.0,1087858000000.0,1113314000000.0,1128238000000.0,1045275000000.0,1072772000000.0,1079454000000.0,1097947000000.0,1141320000000.0,1092714000000.0,-0.563,-0.746,-0.162,0.385,0.222,-17997980000.0,-23225660000.0,-154814800000.0,-67036330000.0,137123500000.0,35170330000.0,39660650000.0,109725500000.0,114687800000.0,287642900000.0,3.092381e+20,...,2.18,0.11,3.25,4.6,4.450651,4.450651,61.6,58.8,7419.0,148319.0,12529.0,199000.0,67521.0,15425.2,16290.7,2076.5,7.3,230623.0,18108.3,101.611,76.3674,1520.0,4090.667,2742.138,-0.07,76.74,1.72,90.8221,128.0097,28529436.0,125.45397,2854690.0,2436.4397,1.755,1.0914,0.7936,0.5963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.8660254,1.0,-1.133108e-15,54861.649127,0
4713,2021-11-27,3933.096895,106.621806,117.585485,1.974035,36687290.0,100637200.0,8.141667,20.0,0.019612,1824.702467,323.37992,1198.635333,1351.680054,251.796533,500.738833,23.146947,1123.792028,159.430844,1062275000000.0,1075901000000.0,1111228000000.0,1153951000000.0,1029009000000.0,1052278000000.0,1073961000000.0,1102508000000.0,1122046000000.0,1044989000000.0,1051227000000.0,1065882000000.0,1085742000000.0,1133249000000.0,1092815000000.0,-0.924,-0.778,-0.212,0.361,0.223,-39251500000.0,-74599060000.0,-176106900000.0,-102881500000.0,117914300000.0,54847610000.0,50667180000.0,108431400000.0,122993500000.0,286638400000.0,7.52065e+20,...,2.18,0.11,3.25,4.6,4.450651,4.450651,61.6,58.8,7419.0,148319.0,12529.0,199000.0,67521.0,15425.2,16290.7,2076.5,7.3,230623.0,18108.3,101.611,76.3674,1520.0,4090.667,2742.138,-0.07,76.74,1.72,90.8221,128.0097,28529436.0,125.45397,2854690.0,2436.4397,1.755,1.0914,0.7936,0.5963,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.5,0.8660254,1.0,-1.133108e-15,,0
4714,2021-11-28,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1.0,1.224647e-16,1.0,-1.133108e-15,1.0,0


# --------------------------------------------------------------------------------------------------------
# 4. Assertain the most important features in our data
# --------------------------------------------------------------------------------------------------------

As mentioned above, we have to specify an interval to run this analysis on by selecting a start and end date in the config.yaml file. This interval is written as a list so that multiple intervals can be chosen, allowing us to iteratively apply PCA across the different intervals. After we have subset the data on our desired time interval, those features which still contain any NaN values will be dropped from the data.

Here we analyse each of the features in this processed data to get a measure of their importance.
We use a RandomForestClassifier for this task and output a table containing all features in our dataset alongside their importance in helping classify the direction of the Bitcoin price.

### Specify classification or regression

Note: This is a legacy step and is only here because of the flexible way the underlying functions were set up. These functions can be run with the target being a price, or with the target being a binary increase or decrease. As such, they are suitible for classification and regression tasks.
In this approaches section, we only used classification so the regression flexability here is redundant.

In [17]:
reg_or_clas = "clas"

### Iterate through the intervals

In [19]:
for train_start_date, train_end_date in tqdm(train_intervals_list):
    print("-------------\nTime Interval:", train_start_date, "-->", train_end_date)
    time_interval_str = '{}_to_{}'.format(train_start_date, train_end_date)
    
    # define the train data for this date interval
    train_df = fi.get_subset_of_data(processed_data, train_start_date, train_end_date)

    # drop any features that have Nan's in them
    subset_train_df = fi.drop_nan_cols(train_df)
    
    # iterate over the shift periods and select the features based on this period
    for shift_period in shift_periods_in_days_list:

        # get a dataframe of the data's features
        train_features, train_cols_to_drop, train_colname = fi.get_features(subset_train_df, shift_period, reg_or_clas)
        
        # get a numpy array of the feature we are trying to predict
        y = np.ravel(train_features[train_colname])
        
        # Use random forest to score feature importance
        feature_importance_df = fi.use_rand_forest_to_get_feature_importance(reg_or_clas, train_features.drop(columns=[train_colname]), y, random_seed)

        print(feature_importance_df.to_markdown())

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1.0), HTML(value='')))

-------------
Time Interval: 2015-08-17 --> 2020-12-31

Dropping the following 'Nan' columns:
 ['france_gov_debt_all', 'denmark_gov_debt_all', 'sweden_gov_debt_all'] 

|      | feature                                        |   importance |
|-----:|:-----------------------------------------------|-------------:|
|    0 | tweets3rsi                                     |  0.00257739  |
|    1 | marketcap3rsi                                  |  0.00239057  |
|    2 | sentbyaddress7mom                              |  0.0022682   |
|    3 | size3std                                       |  0.00226384  |
|    4 | hashrate7roc                                   |  0.00222628  |
|    5 | hashrate3rsi                                   |  0.0021889   |
|    6 | mining_profitability3roc                       |  0.00215473  |
|    7 | price3trx                                      |  0.00212785  |
|    8 | mining_profitability7rsi                       |  0.00211853  |
|    9 | confirmationtime3rsi

# --------------------------------------------------------------------------------------------------------
# 5. Analyse feature importance
# --------------------------------------------------------------------------------------------------------

The below analysis was performed manually based on the results of the above output table. These results may not match up with the current table due to minor discrepencies in the underlying data.
Nevertheless, the below results should still roughly aligh with the results shown in the above table when run.

### Pick out features I added

Each different underlying feature is denoted with a different symbol at the front of the line. The reason for this is that multiple of the most important features are the same feature except the values are based on a technichal indicator that was applied to that feature.

At the time this analysis was done, sorting of the above table was done in ascending order. Since then, I decided descending was a better way for this table to be sorted and as a result, the numbers shown will be the opposite way around.

    !! 1449 - tweets90roc
    !! 1441 - tweets3rsi
    ^^ 1438 - marketcap7rsi
    !! 1433 - tweets7mom
    -- 1430 - google_trends3std
    -- 1421 - google_trends3trx
    ^^ 1418 - marketcap14rsi
    ^^ 1415 - marketcap3rsi
    !! 1410 - tweets3roc
    !! 1401 - tweets7rsi
    !! 1399 - tweets30roc
    !! 1396 - tweets14trx
    !! 1386 - tweets7roc
    !! 1382 - tweets14roc
    -- 1379 - google_trends3rsi
    ^^ 1368 - rank_coinbase 
    !! 1365 - tweets3trx
    ^^ 1364 - marketcap14mom
    ^^ 1363 - marketcap3mom
    ^^ 1344 - rank_bit_x
    ++ 1333 - coffee_futures_close
    -- 1329 - google_trends90roc
    ^^ 1326 - marketcap7roc
    !! 1324 - tweets14mom
    !! 1319 - tweets14rsi
    -- 1318 - google_trends14mom
    ** 1316 - french_cac_40_volume
    !! 1315 - tweets90mom
    ** 1309 - euro_stoxx_50_volume
    -- 1305 - google_trends7trx
    -- 1297 - google_trends14roc
    ++ 1292 - lean_hogs_futures_close
    -- 1284 - google_trends7rsi
    -- 1285 - google_trends7mom
    ++ 1273 - coffee_futures_low
    ++ 1270 - lumber_futures_volume
    !! 1269 - tweets3mom
    -- 1266 - google_trends7roc
    ++ 1254 - rough_rice_futures_volume
    ^^ 1252 - marketcap7trx
    !! 1251 - tweets30mom

### Summarise top 200 features

My features = 20.5 % of Top 200
 - 1%   = stock indices
 - 2.5% = commidity futures
 - 4.5% = other internal features I added
 - 5%   = features on google trend of "bitcoin"
 - 7.5% = features on tweets containing "bitcoin"