# Introduction

Trading for profit has always been a difficult problem to solve, even more so in today’s fast-moving and complex financial markets. Electronic trading allows for thousands of transactions to occur within a fraction of a second, resulting in nearly unlimited opportunities to potentially find and take advantage of price differences in real time.

You will build your own quantitative trading model to maximize returns using market data from a major global stock exchange. 

The challenge will be to use the historical data, mathematical tools, and technological tools at your disposal to create a model that gets as close to certainty as possible. You will be presented with a number of potential trading opportunities, which your model must choose whether to accept or reject.

# Data Description

This dataset contains an anonymized set of features, **feature_{0...129}**, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an **action** value: 1 to make the trade and 0 to pass on it. Each trade has an associated **weight** and **resp**, which together represents a return on the trade. The **date** column is an integer which represents the day of the trade, while **ts_id** represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.

In the training set, train.csv, you are provided a **resp** value, as well as several other **resp_{1,2,3,4}** values that represent returns over different time horizons. These variables are not included in the test set. Trades with **weight = 0** were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation.

### Library

Libraries necessary for the execution of this notebook.

In [None]:
import pandas as pd
import numpy as np
import datatable as dt
from scipy import stats

# Plot
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns
import plotly.express as px

# Preparing features
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

# PCA
from sklearn.decomposition import PCA

# Training and test data
from sklearn.model_selection import train_test_split

# Base estimators
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Ensemble methods
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Warnings
import warnings
warnings.filterwarnings("ignore")

# Loading Data

**features.csv** - metadata pertaining to the anonymized features

**train.csv** - the training set, contains historical data and returns

In [None]:
features = dt.fread('../input/jane-street-market-prediction/features.csv')
features = features.to_pandas()

In [None]:
%%time

train = dt.fread('../input/jane-street-market-prediction/train.csv')
train = train.to_pandas()

print("train size:", train.shape)

# Exploratory Data Analysis

Let's take a look at the summary table of features and training data. Showing data type, missing, unique values, their first three values end entropy value. However, as there are many variables in the training data, let's take a look at the first 25.

In [None]:
def resumetable(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values

    return summary

In [None]:
cm = sns.light_palette("blue", as_cmap=True)

resumetable(features).style.background_gradient(cmap=cm)

In [None]:
cm = sns.light_palette("blue", as_cmap=True)

resumetable(train)[:25].style.background_gradient(subset=['Missing', 'First Value', 
                                                          'Second Value', 'Third Value'], cmap=cm)

## Features overview

Let's take a look at the density curves for **resp** and **weight**.

In [None]:
plt.figure(figsize=(16,12))
plt.subplot(221)
g = sns.distplot(train['resp'])
g.set_title("Resp", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
g = sns.distplot(train['resp_1'])
g.set_title("Resp 1", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.figure(figsize=(16,12))

plt.subplot(221)
g = sns.distplot(train['resp_2'])
g.set_title("Resp 2", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
g = sns.distplot(train['resp_3'])
g.set_title("Resp 3", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.figure(figsize=(16,12))

plt.subplot(221)
g = sns.distplot(train['resp_4'])
g.set_title("Resp 4", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
g = sns.distplot(train['weight'])
g.set_title("weight", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.show()

As we can see, the response values are highly concentrated at 0, thus turning the other values into outliers.

In [None]:
plt.figure(figsize=(13,8))

sns.boxplot(data=train.iloc[:,2:7])

plt.show()

We can perceive both visually and by the quantiles that the values of resp are around 0. However, another 50% of the values of the variable Weight are below 1, showing a high variability in this variable.

In [None]:
print("Quantiles for resp and weight:")
print(train[['resp','resp_1','resp_2','resp_3','resp_4','weight']].quantile([.01, .025, .1, .25, .5, .75, .9, .975, .99]))

Now let's see what the density curve looks like for the first 6 features.

In [None]:
plt.figure(figsize=(16,12))
plt.subplot(221)
g = sns.distplot(train['feature_1'])
g.set_title("Feature 1", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
g = sns.distplot(train['feature_2'])
g.set_title("Feature 2", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.figure(figsize=(16,12))

plt.subplot(221)
g = sns.distplot(train['feature_3'])
g.set_title("Feature 3", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
g = sns.distplot(train['feature_4'])
g.set_title("Feature 4", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.figure(figsize=(16,12))

plt.subplot(221)
g = sns.distplot(train['feature_5'])
g.set_title("Feature 5", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.subplot(222)
g = sns.distplot(train['feature_6'])
g.set_title("Feature 6", fontsize=18)
g.set_xlabel("")
g.set_ylabel("Probability", fontsize=15)

plt.show()

Let's see how is the correlation between all the answers in the database. We can see a high positive correlation between some resp variables, however there is no correlation between the weight variable and the other resp. For now we will not look at the correlations of the features, after all I will use the PCA to decrease the dimension of the data.

In [None]:
plt.subplots(figsize=(13, 10))

matrix = np.triu(train.iloc[:, 1:7].corr())

c = sns.heatmap(train.iloc[:, 1:7].corr(), annot = True, vmin = -1, vmax = 1, center = 0, cmap = 'coolwarm', mask = matrix)
c.set_title("Resp and Weight Correlation", fontsize=18)

plt.show()

# Preparing features

Before applying any model for data prediction, I will give a light organized in the data.

Let's select all the features.

In [None]:
# Select all features
features = train.columns[7:137]
features

Since the weight variable has a very large amount of 0 and this can interfere in the model's forecast, I will remove it from the analysis (who knows, I may leave the values and use it in a model for comparison purposes).

In [None]:
# filtering the values 0
train = train[train['weight'] != 0]

As the competition submission example shows, the name of the output (target) variable is **action** and your answer will be 0 or 1.

In [None]:
train['action'] = (train['resp'].values > 0).astype(int)

The distribution of the categories of the action variable.

In [None]:
plt.figure(figsize=(12, 5))

freq = len(train)

g = sns.countplot(train['action'])
g.set_xlabel("Action", fontsize = 13)
g.set_ylabel("Count", fontsize = 13)

for p in g.patches:
    height = p.get_height()
    g.text(p.get_x() + p.get_width() / 2., height + 3,
          '{:1.2f}%'.format(height / freq * 100),
          ha = "center", fontsize = 15)

With the filter of the weight variable, we will see if there are still missing values.
Well, we have a lot of variables with missing values. It is common in these cases to fill in the missing values with the average of their variables, but taking into account that there may be variables with outlier values, I will choose to fill in the median (for comparison criteria, then I will change to the average and compare the models.)

In [None]:
[col for col in list(train.columns) if train[col].isnull().any()]


In [None]:
train_median = train.median()
train = train.fillna(train_median)

Now, no variables with missing values.

In [None]:
[col for col in train.columns if train[col].isnull().any()]

## Principal Componente Analysis (PCA)

Principal Component Analysis (PCA) is a method for extracting important variables (in the form of components) from a large set of variables, available in a data set. This technique allows you to extract a small number of dimensional sets from a highly dimensional dataset. With fewer variables the visualization also becomes much more significant.

Before going straight to the PCA, it is important to ensure that the input variables are on the same scale, so that the PCA performs better.

In [None]:
scaler = StandardScaler().fit(train.loc[:, features].values)
rescaledX = scaler.transform(train.loc[:, features].values)

In [None]:
pca_mod = PCA()
comp = pca_mod.fit_transform(rescaledX)

exp_var_cumul = np.cumsum(pca_mod.explained_variance_ratio_)

It is possible to notice that, from the graph with the principal components, much of the variance explained for the data set is contained in the first fifteen principal components (80%). So, instead of using the 129 database variables, we can only use 15 principal components in which they will explain 80% of the variability of the original data.

In [None]:
px.area(x = range(1, exp_var_cumul.shape[0] + 1),
    y = exp_var_cumul,
    labels = {"x": "Principal Component", "y": "Explained Variance"}
)

In [None]:
comp.shape

In [None]:
# transforming to dataframe and adding column names
feat_cols = ['PC'+str(i) for i in range(comp.shape[1])]
comp_feat = pd.DataFrame(comp,columns = feat_cols)

In [None]:
comp_feat

# Machine Learning

## Ensemble Methods

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

I'll use some ensemble methods and then check which one got the best performance.

But first, I will divide the data into training and testing.

In [None]:
# Division into training and test data
X_train, X_test, Y_train, Y_test = train_test_split(comp_feat.iloc[:,0:50], 
                                                    train['action'], 
                                                    test_size = 0.30, 
                                                    random_state = 101)

## Random Forest

In [None]:
#%%time

# Create the classifier
#RF = RandomForestClassifier(n_estimators  = 10)

# Training the model
#RF.fit(X_train, Y_train)

#result = RF.score(X_test, Y_test)
#print('Accuracy in test data: %.3f%%' % (result * 100.0))

#-----------------------------------------------------------

#CPU times: user 3min 6s, sys: 125 ms, total: 3min 6s
#Wall time: 3min 6s
#RandomForestClassifier(n_estimators=10)

#Accuracy in test data: 51.502%

## Bagging

For the Bagging estimator, the Logistic Regression model will be used as the base estimator. Not that it is the best model, because it was not my first choice, but it was the model that managed to generate a faster response. I even used the SVM and KNN models as base estimators, but it took a long time to complete. However, compared to Random Forest we have a small loss in accuracy, but we gain in preprocessing time.

In [None]:
#%%time

# Base estimator
#base = LogisticRegression()

# Create the classifier
#BAG = BaggingClassifier(base_estimator = base, max_samples = 0.5, max_features = 0.5)

# Training the model
#BAG.fit(X_train, Y_train)

#result = BAG.score(X_test, Y_test)
#print('Accuracy in test data: %.3f%%' % (result * 100.0))

#-----------------------------------------------------------

#CPU times: user 36.2 s, sys: 1.24 s, total: 37.5 s
#Wall time: 18.9 s
#BaggingClassifier(base_estimator=LogisticRegression(), max_features=0.5,
#                  max_samples=0.5)

#Accuracy in test data: 51.408%

## Adaboost

Using the Decision Tree Classifier algorithm as a base estimator. In this model I had a slightly higher gain in accuracy, but a lot more processing soon.

In [None]:
#%%time

# Base estimator
#base = DecisionTreeClassifier(max_depth = 1, min_samples_leaf = 1)

# Create the classifier
#ADA = AdaBoostClassifier(base_estimator = base,
#                         learning_rate = 0.1, 
#                         n_estimators = 100, 
#                         algorithm = "SAMME.R")

# Training the model
#ADA.fit(X_train, Y_train)

#result = ADA.score(X_test, Y_test)
#print('Accuracy in test data: %.3f%%' % (result * 100.0))

#-----------------------------------------------------------

#CPU times: user 9min 28s, sys: 7.37 s, total: 9min 36s
#Wall time: 9min 36s

#Accuracy in test data: 51.618%

## Gradient Boosting

In [None]:
#%%time

# Create the classifier
#GB = GradientBoostingClassifier()

# Training the model
#GB.fit(X_train, Y_train)

#result = GB.score(X_test, Y_test)
#print('Accuracy in test data: %.3f%%' % (result * 100.0))

#-----------------------------------------------------------

#CPU times: user 23min 59s, sys: 941 ms, total: 23min 59s
#Wall time: 24min 1s

#Accuracy in test data: 52.167%

## XGBoosting

XGBoost showed faster and more accurate processing than Adaboost and Gradient Boosting methods.

In [None]:
#%%time

# Create the classifier
#XGB = XGBClassifier()

# Training the model
#XGB.fit(X_train, Y_train)

#result = XGB.score(X_test, Y_test)
#print('Accuracy in test data: %.3f%%' % (result * 100.0))

#-----------------------------------------------------------

#CPU times: user 5min 25s, sys: 1.08 s, total: 5min 26s
#Wall time: 5min 27s

#Accuracy in test data: 52.203%

## LightGBM

Among all the ensamble methods presented here, LightGBM was the most accurate and fast. With the same speed as Bagging, but with a better fit to the data.

In [None]:
%%time

from lightgbm import LGBMClassifier

# Create the classifier
LGBM = LGBMClassifier()

# Training the model
LGBM.fit(X_train, Y_train)

In [None]:
result = LGBM.score(X_test, Y_test)
print('Accuracy in test data: %.3f%%' % (result * 100.0))

# Submission File

In [None]:
pca = PCA(n_components = 50).fit(train.loc[:, features])

In [None]:
def fillna_npwhere(array, values):
    if np.isnan(array.sum()):
        array = np.where(np.isnan(array), values, array)
    return array

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    wt = test_df.iloc[0].weight
    if(wt == 0):
        sample_prediction_df.action = 0 
    else:
        sample_prediction_df.action = LGBM.predict(pca.transform
                                                          (scaler.transform(
                                                              fillna_npwhere(test_df[features].values,
                                                                             train_median[features].values))))
    env.predict(sample_prediction_df)