# Numerai Stock Market Prediction 

## About
Numerai is a hedge fund that invests its capital by hosting weekly competitions to predict market movements.  Competitors submit predictions and Numerai ranks the predictions based on their novelty compared to other predictions and their accuracy on historical data.  Contained in the competition test data are also future unknown events.  Numerai ensembles the top competitor predictions based on the historical known data to determine where to allocate capital for the future unknown events. 


## Problem Discussion

Numerai provides a weekly dataset with features that have been anonymized to predict a binary 'yes-no target'.  Numerai rewards predictions based both on the models accuracy (logloss metric) and on the model novelty (based on the model correlation with other model predictions).  The logic behind rewarding novel models is that non-correlated models provide new information to the Numerai's meta model.   When the novel models are combined with other top performing models the end result will be a still more accurate model.

## Goal

Numerai allocates their capital to models that place in the top 100 out of the typically more than 600 competitors.  The goal of this project is to produce a model that scores in the top 100 meaning that the model actually has real money deployed.    To place in the top 100 on the leaderboard over the past few weeks translates into a logloss score of ~.689xx.  To score well on the novelty metric we would need to know what the predictions of other submissions and this data is not provided.

In [1]:
import numpy as np
import pandas as pd
from tsne import bh_sne
from sklearn import decomposition
from sklearn.preprocessing import PolynomialFeatures
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import log_loss

Import the raw data provided weekly by [Numerai](https://numer.ai/about)

In [2]:
train = pd.read_csv("../numerai_datasets/numerai_training_data.csv")
combined = pd.read_csv("../numerai_datasets/numerai_tournament_data.csv")

First, for reasons that are still trying to be ascertained by the Numerai competitor community, the testing data contains a copy of the training data.  In this step the duplicate training data contained in the testing data will be removed.

Also, the prediction target variable will be seperated the predictive features.

In [3]:
# Seperate training data from test data
merged = combined.merge(train, how='left', indicator=True)
test = combined[merged['_merge'] == 'left_only']

# Seperate the target data and test_id to create clean target and features
target = train['target']
t_id = test['t_id']
t_id.reset_index(drop=True, inplace=True)

del train['target']
del test['t_id']

print target.shape
print train.shape

(96320,)
(96320, 21)


## Initial Model
To provide an initial baseline we begin by thinking of a naive model that predicts that each outcome is equally likely or basically, a 50-50 situation.  A model that predicts a 50 percent chance of the event occurring would produce a logloss of ~.69315.  The first model that we will explore is the simple logistic regression model without altering the provided data.

## Create testing data
Competitors are provided training data where the correct outcome is known, so initially we will break this into two sets.  The first set will be used to train a model.  The second set of data will not be used in training at all and will be only used to test how accurate the model performs on unseen data.

In [4]:
from sklearn.model_selection import train_test_split

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(train, target, stratify= target, test_size=0.15, random_state=23)


## Simple Logistic Regression Model
Run a logistic regression using sklearn and calculate the models score on the testing data.

In [5]:
from sklearn import linear_model
reg = linear_model.LogisticRegressionCV(random_state=23)

model = reg.fit(X_train,y_train)
Y_pred = model.predict_proba(X_test)

from sklearn.metrics import log_loss

score_log_reg = log_loss(y_test, Y_pred)
print "Logloss for Logistic Regression:", score_log_reg

Logloss for Logistic Regression: 0.691193416914


## Logistic Regression Results
An extremely simple linear model without feature engineering produces a logloss of 0.69119 or a 0.3% improvement over the naive model.  Not a great result, but at least it is better than guessing.  The above reported logloss will be slightly different if the notebook is run on a different weekly dataset.

## Introducing Non-Linearity and Interaction Between Features
The Logistic Regression is extremely simple (though powerful) linear model that does not take into account interaction between terms.  Next we will engineer features to include interactions between all the terms and add non-linearity by squaring the existing features.

In [6]:
poly = PolynomialFeatures(2)
polyfeatures = pd.DataFrame(poly.fit_transform(train))
# drop first column that only contains 1's
polyfeatures = polyfeatures.drop(polyfeatures.columns[0], axis=1)
print polyfeatures.shape

(96320, 252)


We have now increased the number of features from 21 to 252.  We will rerun the Logistic Regression model on the new set of features.

In [7]:
# Split the dataset into a 85% traning and 15% testing set
X_trainp, X_testp, y_trainp, y_testp = train_test_split(polyfeatures, target, stratify= target, test_size=0.15, random_state=20)


from sklearn import linear_model
reg = linear_model.LogisticRegressionCV(random_state=1, n_jobs=-1)

model = reg.fit(X_trainp,y_trainp)
Y_predp = model.predict_proba(X_testp)

from sklearn.metrics import log_loss

score_log_reg = log_loss(y_testp, Y_predp)
print "Logloss for Logistic Regression:", score_log_reg


Logloss for Logistic Regression: 0.691143753005


## Non-Linearity and Interaction Between Features Results
Adding the squared and interaction features produces a logloss of 0.69114.  No real improvement.

## PCA and t-SNE
Since the non-linearity and interaction features did not provide an increase we will move onto two feature-engineering techniques.  PCA and t-SNE are methods that mathematically produce simpler representations of the feature space.  These techniques reduce the amount of noise in the data and can provide a cleaner signal for a learning algorithm to identify.  Hopefully the clearer signal produces better predictions.

In [8]:
# calculate TSNE features
def features_tsne(X):
    features_tsne = pd.DataFrame(bh_sne(X))
    features_tsne.reset_index(drop=True, inplace=True)
    return features_tsne

# calculate PCA features
def features_pca(X):
    pca = decomposition.PCA(n_components=2, random_state=100)
    features_pca = pd.DataFrame(pca.fit_transform(X))
    features_pca.reset_index(drop=True, inplace=True)
    return features_pca

# create polynomial and interaction features
def features_poly(X):
    poly = PolynomialFeatures(2)
    polyfeatures = pd.DataFrame(poly.fit_transform(X))
    # drop first column that only contains 1's
    polyfeatures = pd.DataFrame(polyfeatures.drop(polyfeatures.columns[0], axis=1))
    print polyfeatures.shape
    return polyfeatures

# create scale features
def features_scale(X):
    min_max_scaler = preprocessing.MinMaxScaler()
    X_scaled = pd.DataFrame(min_max_scaler.fit_transform(X))
    return X_scaled

## Combine the Test and Train Features to Create PCA and tSNE Features

In [9]:
all_features = pd.concat([train, test], axis=0)
all_features.reset_index(drop=True, inplace=True)

print all_features.shape

(135259, 21)


In [10]:
print "calculating tsne features"
print ()
f_tsne = features_tsne(all_features)

print "calculating pca features"
print ()
f_pca = features_pca(all_features)

# merge new tsne and pca features with the original features
features_tsne_pca_all = pd.concat([all_features, f_tsne, f_pca], axis=1) 

# create squared and interaction features
f_poly = features_poly(features_tsne_pca_all)

# scale features 
f_scaled = features_scale(f_poly)

# seperate the testing and training features
train_big_features = f_scaled.ix[0:(len(train)-1),]
test_big_features = f_scaled.ix[(len(train)):(len(f_scaled)-1),]

# split the training dataset into a 85% training and 15% testing set
X_train, X_test, y_train, y_test = train_test_split(train_big_features, target, stratify= target, test_size=0.15, random_state=500)

# set algorithym to logistic regression
reg = linear_model.LogisticRegressionCV(random_state=200, n_jobs=-1)

print "Training model . . ."
print ()
model = reg.fit(X_train,y_train)
print "Creating training predictions . . ."
print ()
Y_predp = model.predict_proba(X_test)
score_lr = log_loss(y_test,Y_predp)
print "Logloss for LogisticRegressionCV: ", score_lr    

calculating tsne features
()
calculating pca features
()
(135259, 350)
Training model . . .
()
Creating training predictions . . .
()
Logloss for LogisticRegressionCV:  0.689401056926


## Results with PCA and tSNE features
Creating tSNE features takes about 25 minutes.  The new model with produces a logloss of .0.6894 just shy of the project goal.

## Last Step - Multiple Model Runs in a Loop
The final step will be to run the same model three times over different subsets of the training data and with three different runs of the t-SNE.  The separate t-SNE runs give a different representation of the original features each time so this will add slightly different information for the algorithm to pick up on each time.  Averaging these three runs should increase the predictive power of out model and provide more robust predictions.

In [11]:
# loop through model
predictions = pd.DataFrame()
for n in [1,2,3]:

    print "calculating tsne features"
    f_tsne = features_tsne(all_features)

    print "calculating pca features"
    f_pca = features_pca(all_features)

    # merge new tsne and pca features with the original features
    features_tsne_pca_all = pd.concat([all_features, f_tsne, f_pca], axis=1) 

    # create squared and interaction features
    f_poly = features_poly(features_tsne_pca_all)

    # scale features 
    f_scaled = features_scale(f_poly)

    # seperate the testing and training features
    train_big_features = f_scaled.ix[0:(len(train)-1),]
    test_big_features = f_scaled.ix[(len(train)):(len(f_scaled)-1),]

    # split the training dataset into a 85% training and 15% testing set
    X_train, X_test, y_train, y_test = train_test_split(train_big_features, target, stratify= target, test_size=0.15, random_state=500+n)

    # set algorithym to logistic regression
    reg = linear_model.LogisticRegressionCV(random_state=200+n, n_jobs=-1)

    print "Training model . . ."
    model = reg.fit(X_train,y_train)
    print "Creating training predictions . . ."
    Y_predp = model.predict_proba(X_test)
    score_lr = log_loss(y_test,Y_predp)
    print "Logloss for LogisticRegressionCV: ", score_lr

    print "Creating testing predictions . . ."
    print ()    
    predictions['pred'+str(n)] = model.predict_proba(test_big_features)[:,1]    

calculating tsne features
calculating pca features
(135259, 350)
Training model . . .
Creating training predictions . . .
Logloss for LogisticRegressionCV:  0.689600470573
Creating testing predictions . . .
()
calculating tsne features
calculating pca features
(135259, 350)
Training model . . .
Creating training predictions . . .
Logloss for LogisticRegressionCV:  0.688055670367
Creating testing predictions . . .
()
calculating tsne features
calculating pca features
(135259, 350)
Training model . . .
Creating training predictions . . .
Logloss for LogisticRegressionCV:  0.68716640806
Creating testing predictions . . .
()


## Loop Results
The above three runs produce logloss scores of 0.68960, 0.68805,and 0.68716 and take approximately an hour to run.

## Creating a CSV file for submission

In [12]:
# take the arithmetic mean of three predictions
predictions['probability'] = predictions.mean(axis=1)

# combine the row ids with the predictions
submit = pd.concat([t_id,predictions['probability']], axis=1)

# add in the training rows to the test set as Numerai only excepts complete testing data with the included training data
merged1 = merged.merge(submit, how = 'outer', on='t_id')
merged1['probability'].fillna(merged1['target'], inplace=True)

# take only the columns of interest from the merged data
finalsubmit = merged1[['t_id', 'probability']]

# create CSV file
finalsubmit.to_csv('/Users/jeremycastle/Desktop/numerai/week50/submission/logregcv_loopn1.csv', index=False)

## Submission Result
The final submission scored 0.68965 on Numerai’s public leaderboard and has floated between 38th and 120th.  The submission consistently ranks in the 50th-70th place range out of the current 400 competitor submissions.  The goal of producing a model that is in the Top 100 and commands capital has been achieved at least on the public leaderboard.  Will report back next Wednesday (December 7th) to report on the final private leaderboard results.

## Future Steps
1. The majority of the code runtime comes from calculating the t-SNE features.  There exists a parallel t-SNE module that I have attempted to install, but have not been able run.  Hopefully this parallel module can reduce the runtime considerably.
2. Experiment with different settings in t-SNE, specifically the perplexity features.  
3. There are a number of top 10 competitors producing predictions 20-30 minutes after new data sets are released.  If top 10 predictions are attainable in 30 minutes this should be the time goal.
3. Automate the downloading of the data from Numerai and the uploading of predictions.
4. Explore stacking and ensembling different types of models.
5. Create a validation set to properly score the average of the three models without relying on the Numerai leaderboard to assess model performance.
