# Futures price prediction with classifiers 

The idea of the analysis is to check whether classification algorithms could be useful in FW20 futures price predictions.
Algorithms used in the analysis:
- Logistic Regression
- Support Vector Classifier
- K-Nearest Neighbors
- Random Forest Classifier
- Ada Boost Classifier (with Decision Trees)
- Voting Classifier

Dependent variable:
- Price change in the current day (D) - from open to close (0 for downtrend, 1 for uptrend)

Predictors:
- Rate of return (%) in the previous day (D-1),
- Change in volume (%) in the previous day (D-1),
- Change in open interest (%) in the previous day (D-1).

Validation dataset consists of previous 2000 sessions. For each algorithm 2000 models have been trained, based on data from previous 1000 sessions. Each model has only one session in test set (that comes after training dataset). Efficiency of algorithms is evaluated by accuracy measure (true positives + true negatives / sum of predictions). However, there are also results of profit/loss (in points) printed for each model. In this approch we can consider each algorithm as a simple trading system, in which buy/short sell signals are generated at the opening of each session.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
data_fw20 = pd.read_excel('stocks_data/FW20.xlsx', index_col='Data')

In [4]:
data_fw20.rename(columns={"FW20_Otwarcie":"Open", "FW20_Zamkniecie":"Close", 
                          "FW20_Wolumen":"Volume", "FW20_LOP":"OpenInt"}, inplace=True)

In [5]:
data_fw20.head()

Unnamed: 0_level_0,Open,Close,Volume,OpenInt
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-03,1937,1920,4070,4955.0
2000-01-04,1865,1847,4255,4936.0
2000-01-05,1827,1811,5172,4694.0
2000-01-06,1802,1848,5220,4801.0
2000-01-07,1936,1982,5671,4918.0


## Data preprocessing

Create function that computes changes in values (%) for specified columns for the previous day (change from D-2 to D-1).

In [6]:
def change_previous_day(*args):
    for arg in args:
        data_fw20[arg +'_ret_-1'] = (data_fw20[arg].shift(periods=1) / 
                                      data_fw20[arg].shift(periods=2)-1)*100

Apply change_previous_day() function to columns: Close, Volume and OpenInterest

In [7]:
change_previous_day('Close', 'Volume', 'OpenInt')

In [8]:
data_fw20.tail()

Unnamed: 0_level_0,Open,Close,Volume,OpenInt,Close_ret_-1,Volume_ret_-1,OpenInt_ret_-1
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-12-19,2317,2350,39377,60337.0,1.577564,57.106927,3.689072
2018-12-20,2330,2320,41267,54762.0,1.3805,-9.946028,6.429479
2018-12-21,2313,2275,32056,56332.0,-1.276596,4.799756,-9.23977
2018-12-27,2295,2255,16556,48047.0,-1.939655,-22.320498,2.866952
2018-12-28,2262,2278,11235,47229.0,-0.879121,-48.352882,-14.707449


Create column with direction (change from Open to Close in current day): 1 for uptrend, 0 for downtrend

In [9]:
data_fw20['FW20_dir_o_c'] = np.where(data_fw20['Close'] > data_fw20['Open'], 1, 0)

Create final set of data for analysis

In [10]:
data_fw20_all = data_fw20[['Open','Close','Close_ret_-1','Volume_ret_-1','OpenInt_ret_-1','FW20_dir_o_c']]

Drop few columns with missing data

In [11]:
data_fw20_all.dropna(inplace=True)

Shape of dataframe and names of columns

In [12]:
data_fw20_all.shape

(4751, 6)

In [13]:
data_fw20_all.columns

Index(['Open', 'Close', 'Close_ret_-1', 'Volume_ret_-1', 'OpenInt_ret_-1',
       'FW20_dir_o_c'],
      dtype='object')

In [14]:
data_fw20_all.tail()

Unnamed: 0_level_0,Open,Close,Close_ret_-1,Volume_ret_-1,OpenInt_ret_-1,FW20_dir_o_c
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-19,2317,2350,1.577564,57.106927,3.689072,1
2018-12-20,2330,2320,1.3805,-9.946028,6.429479,0
2018-12-21,2313,2275,-1.276596,4.799756,-9.23977,0
2018-12-27,2295,2255,-1.939655,-22.320498,2.866952,0
2018-12-28,2262,2278,-0.879121,-48.352882,-14.707449,1


## Analysis

For each model there is loop testing xxx datasets - results are collected in a lists and finally a summary with accuracy and results (in pts.) is printed.

### 1. Logistic regression

Increase of regularization (by reducing 'C' parameter to 0,001) resulted in better accuracy compared to default parameters.

In [15]:
train_period = 1000
test_period = 2000
date, predictions, y_actual, pred_proba, results = [], [], [], [], []
model = LogisticRegression(C=0.001)

for t in range(test_period):
    X_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['Close_ret_-1', 'Volume_ret_-1', 
                                                                      'OpenInt_ret_-1']]
    y_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['FW20_dir_o_c']]
    X_test = data_fw20_all.iloc[-t - 1][['Close_ret_-1', 'Volume_ret_-1', 'OpenInt_ret_-1']]
    y_test = data_fw20_all.iloc[-t - 1]['FW20_dir_o_c']
    
    model.fit(X_train, y_train)
    prediction = model.predict(np.array(X_test).reshape(1,-1))
    prediction_proba = model.predict_proba(np.array(X_test).reshape(1,-1))
    date.append(np.datetime64(data_fw20_all.index[-t -1], 'D'))
    y_actual.append(int(y_test))
    predictions.append(prediction[0])
    pred_proba.append(prediction_proba)
    
    result = abs(data_fw20_all.iloc[-t - 1]['Close'] - data_fw20_all.iloc[-t - 1]['Open'])
    results.append(result if y_test == prediction else -result)

accuracy = 0
for i in range(len(predictions)):
    if y_actual[i] == predictions[i]:
        accuracy+=1
print('Model summary')
print('Accuracy: {}'.format(round(accuracy / len(predictions),4)))
print('Result (pts.): {}'.format(sum(results)))
print('Average result per 1 session: {}'.format(sum(results) / test_period))
print('*'*100)

# for i in range(len(predictions)):
#     print(date[i], '\t', 'actual:', y_actual[i], predictions[i], ':prediction', 'preds_proba:', pred_proba[i],
#          '\t', 'result:', results[i])

Model summary
Accuracy: 0.534
Result (pts.): 2366.0
Average result per 1 session: 1.183
****************************************************************************************************


### 2. Support Vector Classification

Increase of regularization resulted in slightly better accuracy compared to default parameters, but all of predictions were one-way - indicating downtrend (0). Hence, no regularization was applied.
SVM Classifier requires rescaling of the data (with StandardScaler, separately for each training dataset) 

In [16]:
train_period = 1000
test_period = 2000
date, predictions, y_actual, results = [], [], [], []
model = SVC()
scaler = StandardScaler()

for t in range(test_period):
    X_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['Close_ret_-1', 'Volume_ret_-1', 
                                                                      'OpenInt_ret_-1']]
    y_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['FW20_dir_o_c']]
    X_test = data_fw20_all.iloc[-t - 1][['Close_ret_-1', 'Volume_ret_-1', 'OpenInt_ret_-1']]
    y_test = data_fw20_all.iloc[-t - 1]['FW20_dir_o_c']
    
    X_train_sc = scaler.fit_transform(X_train)
    X_test_sc = scaler.transform(np.array(X_test).reshape(1,-1))
    
    model.fit(X_train_sc, y_train)
    prediction = model.predict(np.array(X_test_sc).reshape(1,-1))
    date.append(np.datetime64(data_fw20_all.index[-t -1], 'D'))
    y_actual.append(int(y_test))
    predictions.append(prediction[0])
    
    result = abs(data_fw20_all.iloc[-t - 1]['Close'] - data_fw20_all.iloc[-t - 1]['Open'])
    results.append(result if y_test == prediction else -result)

accuracy = 0
for i in range(len(predictions)):
    if y_actual[i] == predictions[i]:
        accuracy+=1
print('Model summary')
print('Accuracy: {}'.format(round(accuracy / len(predictions),4)))
print('Result (pts.): {}'.format(sum(results)))
print('Average result per 1 session: {}'.format(sum(results) / test_period))
print('*'*100)

# for i in range(len(predictions)):
#     print(date[i], '\t', 'actual:', y_actual[i], predictions[i], ':prediction', '\t', 'result:', results[i])

Model summary
Accuracy: 0.5315
Result (pts.): 1758.0
Average result per 1 session: 0.879
****************************************************************************************************


### 3. K-Nearest Neighbors

After some optimization, the parameter "number of neighbors" was set at 200.

In [19]:
train_period = 1000
test_period = 2000
date, predictions, y_actual, pred_proba, results = [], [], [], [], []
model = KNeighborsClassifier(n_neighbors = 200)

for t in range(test_period):
    X_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['Close_ret_-1', 'Volume_ret_-1', 
                                                                      'OpenInt_ret_-1']]
    y_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['FW20_dir_o_c']]
    X_test = data_fw20_all.iloc[-t - 1][['Close_ret_-1', 'Volume_ret_-1', 'OpenInt_ret_-1']]
    y_test = data_fw20_all.iloc[-t - 1]['FW20_dir_o_c']

    model.fit(X_train, y_train)
    prediction = model.predict(np.array(X_test).reshape(1,-1))
    prediction_proba = model.predict_proba(np.array(X_test).reshape(1,-1))
    date.append(np.datetime64(data_fw20_all.index[-t -1], 'D'))
    y_actual.append(int(y_test))
    predictions.append(prediction[0])
    pred_proba.append(prediction_proba)
    
    result = abs(data_fw20_all.iloc[-t - 1]['Close'] - data_fw20_all.iloc[-t - 1]['Open'])
    results.append(result if y_test == prediction else -result)

accuracy = 0
for i in range(len(predictions)):
    if y_actual[i] == predictions[i]:
        accuracy+=1
print('Model summary')
print('Accuracy: {}'.format(round(accuracy / len(predictions),4)))
print('Result (pts.): {}'.format(sum(results)))
print('Average result per 1 session: {}'.format(sum(results) / test_period))
print('*'*100)

# for i in range(len(predictions)):
#     print(date[i], '\t', 'actual:', y_actual[i], predictions[i], ':prediction', 'preds_proba:', pred_proba[i],
#          '\t', 'result:', results[i])

Model summary
Accuracy: 0.527
Result (pts.): 2488.0
Average result per 1 session: 1.244
****************************************************************************************************


### 4. Random Forest Classifier

Number of trees in each forest set at 100, max depth of each tree set at 3.

In [20]:
train_period = 1000
test_period = 2000
date, predictions, y_actual, pred_proba, results = [], [], [], [], []
model = RandomForestClassifier(n_estimators=100, max_depth=3)

for t in range(test_period):
    X_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['Close_ret_-1', 'Volume_ret_-1', 
                                                                      'OpenInt_ret_-1']]
    y_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['FW20_dir_o_c']]
    X_test = data_fw20_all.iloc[-t - 1][['Close_ret_-1', 'Volume_ret_-1', 'OpenInt_ret_-1']]
    y_test = data_fw20_all.iloc[-t - 1]['FW20_dir_o_c']

    model.fit(X_train, y_train)
    prediction = model.predict(np.array(X_test).reshape(1,-1))
    prediction_proba = model.predict_proba(np.array(X_test).reshape(1,-1))
    date.append(np.datetime64(data_fw20_all.index[-t -1], 'D'))
    y_actual.append(int(y_test))
    predictions.append(prediction[0])
    pred_proba.append(prediction_proba)
    
    result = abs(data_fw20_all.iloc[-t - 1]['Close'] - data_fw20_all.iloc[-t - 1]['Open'])
    results.append(result if y_test == prediction else -result)

accuracy = 0
for i in range(len(predictions)):
    if y_actual[i] == predictions[i]:
        accuracy+=1
print('Model summary')
print('Accuracy: {}'.format(round(accuracy / len(predictions),4)))
print('Result (pts.): {}'.format(sum(results)))
print('Average result per 1 session: {}'.format(sum(results) / test_period))
print('*'*100)

# for i in range(len(predictions)):
#     print(date[i], '\t', 'actual:', y_actual[i], predictions[i], ':prediction', 'preds_proba:', pred_proba[i],
#          '\t', 'result:', results[i])

Model summary
Accuracy: 0.538
Result (pts.): 2878.0
Average result per 1 session: 1.439
****************************************************************************************************


### 5. AdaBoost Classifier (with Decision Tree)

Number of estimators set at 50, max depth of each tree set at 1 ("decision stump").

In [21]:
train_period = 1000
test_period = 2000
date, predictions, y_actual, pred_proba, results = [], [], [], [], []
model = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=50)

for t in range(test_period):
    X_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['Close_ret_-1', 'Volume_ret_-1', 
                                                                      'OpenInt_ret_-1']]
    y_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['FW20_dir_o_c']]
    X_test = data_fw20_all.iloc[-t - 1][['Close_ret_-1', 'Volume_ret_-1', 'OpenInt_ret_-1']]
    y_test = data_fw20_all.iloc[-t - 1]['FW20_dir_o_c']

    model.fit(X_train, y_train)
    prediction = model.predict(np.array(X_test).reshape(1,-1))
    prediction_proba = model.predict_proba(np.array(X_test).reshape(1,-1))
    date.append(np.datetime64(data_fw20_all.index[-t -1], 'D'))
    y_actual.append(int(y_test))
    predictions.append(prediction[0])
    pred_proba.append(prediction_proba)

    result = abs(data_fw20_all.iloc[-t - 1]['Close'] - data_fw20_all.iloc[-t - 1]['Open'])
    results.append(result if y_test == prediction else -result)

accuracy = 0
for i in range(len(predictions)):
    if y_actual[i] == predictions[i]:
        accuracy+=1
print('Model summary')
print('Accuracy: {}'.format(round(accuracy / len(predictions),4)))
print('Result (pts.): {}'.format(sum(results)))
print('Average result per 1 session: {}'.format(sum(results) / test_period))
print('*'*100)

# for i in range(len(predictions)):
#     print(date[i], '\t', 'actual:', y_actual[i], predictions[i], ':prediction', 'preds_proba:', pred_proba[i],
#          '\t', 'result:', results[i])

Model summary
Accuracy: 0.514
Result (pts.): 332.0
Average result per 1 session: 0.166
****************************************************************************************************


### 6. Voting Classifier

Finally, there was Voting Classifier used for 3 estimators with best results - Logistic Regression, KNeighbors and Random Forest. Majority rule for voting was applied.

In [22]:
train_period = 1000
test_period = 2000
date, predictions, y_actual, results = [], [], [], []

lr_clf = LogisticRegression(C=0.001)
knn_clf = KNeighborsClassifier(n_neighbors=200)
rfc_clf = RandomForestClassifier(n_estimators=100, max_depth=3)
voting_clf = VotingClassifier(
    estimators = [('lr', lr_clf),('knn_clf', knn_clf),('rfc_clf', rfc_clf)],
    voting = 'hard')

for t in range(test_period):
    X_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['Close_ret_-1', 'Volume_ret_-1', 
                                                                      'OpenInt_ret_-1']]
    y_train = data_fw20_all.iloc[(-train_period - t - 1) : (-t - 1)][['FW20_dir_o_c']]
    X_test = data_fw20_all.iloc[-t - 1][['Close_ret_-1', 'Volume_ret_-1', 'OpenInt_ret_-1']]
    y_test = data_fw20_all.iloc[-t - 1]['FW20_dir_o_c']

    voting_clf.fit(X_train, y_train)
    prediction = voting_clf.predict(np.array(X_test).reshape(1,-1))
    date.append(np.datetime64(data_fw20_all.index[-t -1], 'D'))
    y_actual.append(int(y_test))
    predictions.append(prediction[0])
    
    result = abs(data_fw20_all.iloc[-t - 1]['Close'] - data_fw20_all.iloc[-t - 1]['Open'])
    results.append(result if y_test == prediction else -result)

accuracy = 0
for i in range(len(predictions)):
    if y_actual[i] == predictions[i]:
        accuracy+=1
print('Model summary')
print('Accuracy: {}'.format(round(accuracy / len(predictions),4)))
print('Result (pts.): {}'.format(sum(results)))
print('Average result per 1 session: {}'.format(sum(results) / test_period))
print('*'*100)

# for i in range(len(predictions)):
#     print(date[i], '\t', 'actual:', y_actual[i], predictions[i], ':prediction', '\t', 'result:', results[i])

Model summary
Accuracy: 0.5335
Result (pts.): 2620.0
Average result per 1 session: 1.31
****************************************************************************************************


## Conclusions

Using simple classifiers may give some slight advantage over the market. Results obtained in the analysis show that all of the tested algorithms have accuracy over 50%. The highest accuracy was obtained for Random Forest (about 54%). Unfortunately, Voting Classifier didn't boost the results compared to best single estimator (Random Forest).

Suggestions for further analysis:
- wider interval (week or even month) - daytrading implies high transaction costs,
- application of other predictors - for example rates of return from other correlated markets (SPX, NDX, DAX, NKX),
- reducing the number of signals by selecting predictions with highest probability.