Because this is a model I am actually using to try to make accurate bets in the futures market, I have masked some of my data from public view. However, I would like to share my thought process, and explain what I did to get my results.

For context, I was examining data from the CFTC's Commitments of Traders Report to find out for myself if the legends of the report's predictive accuracy were true. Could mining this report for patterns and correlations allow the financial data analyst to make predictions with any kind of accuracy? I spent some time going through the report and testing the covariance between certain variables and the next week's change of price. I found that there does seem be a connection between the two, enough of one that with blind testing on data my model had not seen, it was able to predict the next week's change of price (up or down) with 64% accuracy. 

After achieving this result, I used Principle Component Analysis to decompose the data into the five variables you see below. There's not a lot of interpretability there, so if you would like to predict price direction with the report yourself, the data is out there. I got my price data using the Chicago Mercantile Exchange api at quandl.com (a terrific website for financial data streams), and you can find the Commitments of Traders Report historical data at: http://www.cftc.gov/MarketReports/CommitmentsofTraders. They have all the data preserved in CSV.

In [6]:
import pandas as pd
model_df = pd.read_csv('pca_transformed_xvars.csv')
y_vars = pd.read_csv('price_changes.csv')

In [19]:
model_df.head() #here are my decomposed X variables

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5
0,0,2.163114,1.079553,-0.126914,0.270626,0.098197,-0.328149
1,1,3.667378,0.056017,0.050201,-1.076702,-0.176977,-0.19707
2,2,2.96346,0.805376,-0.768775,-0.293738,0.057174,0.110143
3,3,1.464611,0.59083,-0.156438,0.795129,-0.548107,-0.064407
4,4,1.067147,0.149905,0.102411,-0.476938,0.632868,0.788675


In [20]:
y_vars.head() #this is my price change data. The two categories are 'Down' and 'Up' as you will see in my models below.

Unnamed: 0.1,Unnamed: 0,0
0,0,1
1,1,1
2,2,0
3,3,0
4,4,0


First I used a random forest model, which is essentially a bootstrap-aggregated decision tree model. It carves the data up into categories, and then subcategories, and then sub-subcategories. It then uses these groupings to predict on unseen data. It does this multiple times in order to correct any mistakes it makes for a given iteration.

In [35]:
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.grid_search import GridSearchCV

y = y_vars['0']

X = model_df[['0', '1', '2', '3', '4', '5']]

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.33)

cv = StratifiedKFold(Y_train, n_folds=3, shuffle=True, random_state=41)

rt = RandomForestClassifier(n_jobs=-1, n_estimators=100)
param_grid = { 
     'max_depth' : [None, 1, 3, 5, 7, 10]
}
CV_rt = GridSearchCV(estimator=rt, param_grid=param_grid, cv= 5)
CV_rt.fit(X_train, Y_train)
print CV_rt.best_params_
rt.fit(X_train, Y_train)

Y_pred = rt.predict_proba(X_test)
Y_plot = rt.predict(X_test)

print rt.score(X_test, Y_test)
s = cross_val_score(rt, X_train, Y_train, cv=cv, n_jobs=-1)
print "{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3))

print Y_pred


{'max_depth': 1}
0.574850299401
Random Forest Score:	0.513 ± 0.029
[[ 0.95        0.05      ]
 [ 0.905       0.095     ]
 [ 0.86        0.14      ]
 [ 1.          0.        ]
 [ 0.87        0.13      ]
 [ 0.75        0.25      ]
 [ 0.99        0.01      ]
 [ 0.89666667  0.10333333]
 [ 0.94        0.06      ]
 [ 0.3         0.7       ]
 [ 0.87        0.13      ]
 [ 0.93        0.07      ]
 [ 0.12        0.88      ]
 [ 0.99        0.01      ]
 [ 0.87        0.13      ]
 [ 0.49452381  0.50547619]
 [ 0.64        0.36      ]
 [ 0.48333333  0.51666667]
 [ 0.76        0.24      ]
 [ 0.07488095  0.92511905]
 [ 1.          0.        ]
 [ 0.87        0.13      ]
 [ 0.59        0.41      ]
 [ 0.13654762  0.86345238]
 [ 0.43        0.57      ]
 [ 0.75333333  0.24666667]
 [ 0.87        0.13      ]
 [ 0.21        0.79      ]
 [ 0.9         0.1       ]
 [ 0.58        0.42      ]
 [ 0.98        0.02      ]
 [ 0.77        0.23      ]
 [ 0.97        0.03      ]
 [ 0.585       0.415     ]
 [ 0.11        

The random forest has an accuracy score of .51. This is not much better than random guessing. 
However, it becomes an important part of the ensemble model later on.

For each week, the random forest model predicts whether the price will go down or up. Each of the following models does the same.

In [36]:
RT_pp = pd.DataFrame(rt.predict_proba(X_test), columns=['RF Price_Down','RF Price_Up'])
print(RT_pp.iloc[0:10])

   RF Price_Down  RF Price_Up
0       0.950000     0.050000
1       0.905000     0.095000
2       0.860000     0.140000
3       1.000000     0.000000
4       0.870000     0.130000
5       0.750000     0.250000
6       0.990000     0.010000
7       0.896667     0.103333
8       0.940000     0.060000
9       0.300000     0.700000


In [37]:
from sklearn.naive_bayes import GaussianNB
GNBmodel = GaussianNB().fit(X_train, Y_train)
GNB_pp = pd.DataFrame(GNBmodel.predict_proba(X_test), columns=['NB Price_Down','NB Price_Up'])
GNBmodel.score(X_test, Y_test)

0.61676646706586824

Above I use a Naive Bayes model to make predictions using the same data. Naive Bayes assumes independence for each variable and uses Bayes' theorem to attempt to determine how much each variable contributes to the probability of price going up/down. As you can see below, the predictions differ from the random forest model. This model is more accurate overall (by nearly 10%).

In [38]:
print(GNB_pp.iloc[0:10])

   NB Price_Down  NB Price_Up
0       0.562408     0.437592
1       0.578966     0.421034
2       0.484829     0.515171
3       0.641979     0.358021
4       0.456699     0.543301
5       0.632247     0.367753
6       0.632161     0.367839
7       0.333357     0.666643
8       0.704619     0.295381
9       0.741286     0.258714


In [39]:
from sklearn.neighbors import KNeighborsClassifier
knnmodel = KNeighborsClassifier(n_neighbors=7).fit(X_train, Y_train)
knn_pp = pd.DataFrame(knnmodel.predict_proba(X_test), columns=['KNN Price_Down','KNN Price_Up'])
knnmodel.score(X_test, Y_test)

0.60479041916167664

The K-Nearest Neighbors model above is comparable to Naive Bayes. This model saves all the training data to memory, and then for any given point, it counts the K points (in this case 7 points) nearest to it to determine the percent likelihood that that point is a "price up" vs. a "price down" point.

In [40]:
from sklearn import linear_model 
lr = linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
lr.fit(X_train, Y_train)
lr_pp = pd.DataFrame(lr.predict_proba(X_test), columns=['LR Price_Down','LR Price_Up'])
lr.score(X_test, Y_test)

0.59281437125748504

And simple logistic regression also performs in the same ballpark (for non-stats people reading this, logistic regression is among the most commonly used models for predicting binary classes). All the models have similar accuracy overall, but when predicting any given week, there is a large spread. The models seldom agree with one another. 

Below, I created an ensemble model that leverages the variances of the different models to improve their collective score.

In [41]:
ensemble_df = pd.DataFrame()
ensemble_df['RF Price_Down'] = RT_pp['RF Price_Down']
ensemble_df['RF Price_Up'] = RT_pp['RF Price_Up']
ensemble_df['NB Price_Down'] = GNB_pp['NB Price_Down']
ensemble_df['NB Price_Up'] = GNB_pp['NB Price_Up']
ensemble_df['KNN Price_Down'] = knn_pp['KNN Price_Down']
ensemble_df['KNN Price_Up'] = knn_pp['KNN Price_Up']
ensemble_df['LR Price_Down'] = lr_pp['LR Price_Down']
ensemble_df['LR Price_Up'] = lr_pp['LR Price_Up']

ensemble_df['Ensemble Price_Down'] = (ensemble_df['RF Price_Down'] + ensemble_df['NB Price_Down'] +
                                      ensemble_df['KNN Price_Down'] + ensemble_df['LR Price_Down'])/4

ensemble_df['Ensemble Price_Up'] = (ensemble_df['RF Price_Up'] + ensemble_df['NB Price_Up'] +
                                      ensemble_df['KNN Price_Up'] + ensemble_df['LR Price_Up'])/4
ensemble_df['pred_class_thresh50'] = [1 if x >= 0.5 else 0 for x in ensemble_df['Ensemble Price_Up'].values]

confusion = np.array(confusion_matrix(Y_test, ensemble_df['pred_class_thresh50']))
print(confusion)
print(classification_report(Y_test, ensemble_df['pred_class_thresh50']))

[[66 25]
 [35 41]]
             precision    recall  f1-score   support

          0       0.65      0.73      0.69        91
          1       0.62      0.54      0.58        76

avg / total       0.64      0.64      0.64       167



64% using the ensemble method! By averaging all the models, we see a 3-10% boost in accuracy across them all. This may not sound great, but in the futures market, any accuracy score above 50% means that you're winning more often than you're losing, which is the best we can ask for.