# Kaggle Project: Identifying Profitable Trades using XGBoost

The efficient market hypothesis posits that markets cannot be beaten because asset prices will always reflect the fundamental value of the assets. In a perfectly efficient market, buyers and sellers would have all the agency and information needed to make rational trading decisions. 

In reality, financial markets are not efficient. The purpose of this trading model is to identify arbitrage opportunities to "buy low and sell high". In other words, we exploit market inefficiencies to identify and decide whether to execute profitable trades.

The dataset, provided by Jane Street, contains an anonymized set of 129 features representing real stock market data. Each row in the dataset represents a trading opportunity, for which I predict an action value: 1 to make the trade and 0 to pass on it. Due to the high dimensionality of the dataset, I use Principal Components Analysis (PCA) to identify features to be used for supervised learning. The intuition is to compress the dataset and use it more efficiently. I then use XGBoost (extreme gradient boosting) - a hugely popular ML library due to its superior execution speed and model performance - to predict profitable trades. I also use Optuna (an automatic hyperparameter optimization software framework) to tune the hyperparameters of the classification model.

Please upvote if you find this notebook helpful! 😊 Thank you! I would also be very happy to receive feedback on my work.

# 1) Import important libraries and packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import xgboost as xgb
import optuna

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 2) Load and clean dataset

In [None]:
# Import dataset as train
train = pd.read_csv('/kaggle/input/jane-street-market-prediction/train.csv', nrows=2200000)
train.info()

In [None]:
# Drop rows with 'weight'=0 
# Trades with weight = 0 were intentionally included in the dataset for completeness, 
# although such trades will not contribute towards the scoring evaluation
train = train[train['weight']!=0]

# Create 'action' column (dependent variable)
# The 'action' column is defined as such because of the evaluation metric used for this project.
# We want to maximise the utility function and hence pi where pi=∑j(weightij∗respij∗actionij)
# Positive values of resp will increase pi
train['action'] = train['resp'].apply(lambda x:x>0).astype(int)

In [None]:
features = [col for col in list(train.columns) if 'feature' in col]

In [None]:
X = train[features]
y = train['action']

# Next, we hold out part of the training data to form the hold-out validation set
train_x, valid_x, train_y, valid_y = train_test_split(X, y, test_size=0.2)

# 3) Exploratory data analysis

In [None]:
# First, we want to check if the target class is balanced or unbalanced in the training data
sns.set_palette("colorblind")
ax = sns.barplot(train_y.value_counts().index, train_y.value_counts()/len(train_y))
ax.set_title("Proportion of trades with action=0 and action=1")
ax.set_ylabel("Percentage")
ax.set_xlabel("Action")
sns.despine();
# Target class is fairly balanced with almost 50% of trades corresponding to each action

In [None]:
# Next, we plot a diagonal correlation heatmap to see if there are strong correlations between the features

# Compute the correlation matrix
#corr = train_x.corr()

# Generate a mask for the upper triangle
#mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
#f, ax = plt.subplots(figsize=(12, 10))

# Generate a custom diverging colormap
#cmap = sns.diverging_palette(20, 230, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
#sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0,
#            square=True, linewidths=.5, cbar_kws={"shrink": .5})

# There are strong correlations between several of the features

In [None]:
# Finally, we investigate if there are missing values and we impute them
missing_values = pd.DataFrame()
missing_values['feature'] = features
missing_values['num_missing'] = [train_x[i].isna().sum() for i in features]
missing_values.T
# There are quite a lot of missing values across the features

In [None]:
train_median = train_x.median()
# Impute medians in both training set and the hold-out validation set
train_x = train_x.fillna(train_median)
valid_x = valid_x.fillna(train_median)

# 4) Principal Components Analysis

In [None]:
# Before we perform PCA, we need to normalise the features so that they have zero mean and unit variance
scaler = StandardScaler()
scaler.fit(train_x)
train_x_norm = scaler.transform(train_x)

pca = PCA()
comp = pca.fit(train_x_norm)

# We plot a graph to show how the explained variation in the 129 features varies with the number of principal components
plt.plot(np.cumsum(comp.explained_variance_ratio_))
plt.grid()
plt.xlabel('Number of Principal Components')
plt.ylabel('Explained Variance')
sns.despine();

# The first 15 principal components explains about 80% of the variation
# The first 40 principal components explains about 95% of the variation

In [None]:
# Using the first 70 principal components, we apply the PCA mapping
# From here on, we work with only 70 features instead of the full set of 129 features
pca = PCA(n_components=70).fit(train_x_norm)
train_x_transform = pca.transform(train_x_norm)

In [None]:
# Transform the validation set
valid_x_transform = pca.transform(scaler.transform(valid_x))

# 5) Train XGBoost classifier + Tune hyperparameters using Optuna

In [None]:
# We create the XGboost-specific DMatrix data format from the numpy array. 
# This data structure is optimised for memory efficiency and training speed
dtrain = xgb.DMatrix(train_x_transform, label=train_y)
dvalid = xgb.DMatrix(valid_x_transform, label=valid_y)

In [None]:
# The objective function is passed an Optuna specific argument of trial
def objective(trial):
    
# params specifies the XGBoost hyperparameters to be tuned
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 200, 600),
        'max_depth': trial.suggest_int('max_depth', 10, 25),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.05),
        'subsample': trial.suggest_uniform('subsample', 0.80, 1),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.80, 1),
        'gamma': trial.suggest_int('gamma', 0, 15),
        'tree_method': 'gpu_hist',  
        'objective': 'binary:logistic'
    }
    
    bst = xgb.train(params, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
# trials will be evaluated based on their accuracy on the test set
    accuracy = sklearn.metrics.accuracy_score(valid_y, pred_labels)
    return accuracy

In [None]:
if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=35, timeout=600)

    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial

    print("  Value: {}".format(trial.value))
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

In [None]:
best_params = trial.params
best_params['tree_method'] = 'gpu_hist' 
best_params['objective'] = 'binary:logistic'

In [None]:
# Fit the XGBoost classifier with optimal hyperparameters
optimal_clf = xgb.XGBClassifier(**best_params)

In [None]:
optimal_clf.fit(train_x_transform, train_y)

In [None]:
# Plot how the best accuracy evolves with number of trials
fig = optuna.visualization.plot_optimization_history(study)
fig.show();

In [None]:
# We can also plot the relative importance of different hyperparameter settings
fig = optuna.visualization.plot_param_importances(study)
fig.show();

# 6) Fit classifier on unseen test set

In [None]:
# We impute the missing values with the medians
def fillna_npwhere(array, values):
    if np.isnan(array.sum()):
        array = np.where(np.isnan(array), values, array)
    return array

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    wt = test_df.iloc[0].weight
    if(wt == 0):
        sample_prediction_df.action = 0 
    else:
        sample_prediction_df.action = optimal_clf.predict(pca.transform(scaler.transform(fillna_npwhere(test_df[features].values,train_median[features].values))))
    env.predict(sample_prediction_df)

# Acknowledgements
https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

https://www.kaggle.com/saurabhshahane/voting-classifier-beginners

https://www.kaggle.com/harshitt21/jane-street-basic-eda-xgb-baseline

https://www.kaggle.com/eudmar/jane-street-eda-pca-ensemble-methods

https://www.kaggle.com/gogo827jz/optimise-speed-of-filling-nan-function?scriptVersionId=48926407

https://github.com/datacamp/Machine-Learning-With-XGboost-live-training/blob/master/notebooks/Machine-Learning-with-XGBoost-solution.ipynb

https://www.kaggle.com/marketneutral/purged-time-series-cv-xgboost-optuna

https://www.kaggle.com/miklgr500/optuna-xgbclassifier-parameters-optimize

https://github.com/optuna/optuna/blob/master/examples/xgboost_simple.py