# Market Prediction Using XGBoost 

**Market Basics:** Financial market is a dynamic world where investors, speculators, traders, hedgers understand the market by different strategies and use the opportunities to make profit. They may use fundamental, technical analysis, sentimental analysis,etc. to place their bet. As data is growing, many professionals use data to understand and analyse previous trends and predict the future prices to book profit.

**Competition Description:** The dataset provided contains set of features, **feature_{0...129}**,representing real stock market data. 
Each row in the dataset represents a trading opportunity, for which we will be predicting an action value: 1 to make the trade and 0 to pass on it. 
Each trade has an associated weight and resp, which together represents a return on the trade. 
In the training set, **train.csv**, you are provided a **resp** value, as well as several other **resp_{1,2,3,4}** values that represent returns over different time horizons.

In **Test set** we don't have **resp** value, and other **resp_{1,2,3,4}** data, so we have to use only **feature_{0...129}** to make prediction.

Trades with **weight = 0** were intentionally included in the dataset for completeness, although such trades **will not** contribute towards the scoring evaluation. So we will ignore it.

**XGBoost Classification** is used here with hyperparamter tuning. Please go through the notebook, I have tried to explain every step. If you find this notebook helpful please **UPVOTE** it!😊

Comments, suggestions, and queries are appreciated. Happy Learning!🎯

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Import Libraries 📂

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import xgboost as xgb
import optuna   
import cudf
import warnings
warnings.filterwarnings("ignore")

# Importing Data 📚
Cudf is faster than pandas for reading csv file, so we will use that.


In [None]:
# Using cudf 
train_cudf = cudf.read_csv('../input/jane-street-market-prediction/train.csv')
train = train_cudf.to_pandas()
del train_cudf
train = train.astype({c: np.float32 for c in train.select_dtypes(include='float64').columns}) #limit memory use

# Cleaning Data 🪓
DataFrame.query is faster than slicing method, so we will use that.

In [None]:
#We don't want weight=0 datas so we are ignoring it.
train = train.query('weight > 0').reset_index(drop = True)
train.shape

# Understanding Features 📊

In [None]:
TRADING_THRESHOLD = 0.500
train.describe()

In [None]:
#Checking Missing Values in the features
n_features = 45
nan_val = train.isna().sum()[train.isna().sum() > 0].sort_values(ascending=False)
print(nan_val)


fig, axs = plt.subplots(figsize=(10, 10))

sns.barplot(y = nan_val.index[0:n_features], 
            x = nan_val.values[0:n_features], 
            alpha = 0.8
           )

plt.title(f'NaN values of train dataset (Top {n_features})')
plt.xlabel('NaN values')
fig.savefig(f'nan_values_top_{n_features}_features.png')
plt.show()

In [None]:
#Filling the missing values with median value 
f_median = train.median()
x_train = train.fillna(f_median)

# Creating Train and Test DataFrame 

In [None]:
# Generating 0 or 1 values on the basis of resp features and storing it to 'action' column
# It will serve as our test data 
train['action'] = (train['resp'] > 0 ).astype('int')

In [None]:
X = train.loc[:, train.columns.str.contains('feature')]
y = train.loc[:, 'action']

# Splitting X,y into train and validation data 
x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state = 42)

In [None]:
# We can use this to fill nan values in unseen test but here I am not using this
test_median = X.median()

# Exploratory Data Analysis 📈

In [None]:
# We will check if the target class is balanced or unbalanced in the training data
sns.set_palette("hls")
ax = sns.barplot(y_train.value_counts().index, y_train.value_counts()/len(y_train))
ax.set_title("Proportion of trades with action=0 and action=1")
ax.set_ylabel("Percentage")
ax.set_xlabel("Action")
sns.despine();

In [None]:
#Resp Analysis
#Last subplot doesn't mean anything
resp_df = ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']
fig, axes = plt.subplots(nrows=2
                         , ncols=3,figsize=(20,10))
for i, column in enumerate(resp_df):
    sns.distplot(train[column],ax=axes[i//3,i%3])

In [None]:
# Cumulative return analysis
fig, ax = plt.subplots(figsize=(16, 8))

resp = train['resp'].cumsum()
resp_1 = train['resp_1'].cumsum()
resp_2 = train['resp_2'].cumsum()
resp_3 = train['resp_3'].cumsum()
resp_4 = train['resp_4'].cumsum()

resp.plot(linewidth=2)
resp_1.plot(linewidth=2)
resp_2.plot(linewidth=2)
resp_3.plot(linewidth=2)
resp_4.plot(linewidth=2)

ax.set_xlabel ("Trade", fontsize=12)
ax.set_title ("Cumulative Trade Returns", fontsize=18)

plt.legend(loc="upper left");

resp and resp_4 variable are closely related so we can use this to set our 'action' variable.


# Training XGBClassifier | Using Optuna for Hyperparameter Tuning

In [None]:
# Created the Xgboost specific DMatrix data format from the numpy array to optimise memory consumption
dtrain = xgb.DMatrix(x_train, label=y_train)
dvalid = xgb.DMatrix(x_valid, label=y_valid)

In [None]:
def objective(trial):
    
# params specifies the XGBoost hyperparameters to be tuned
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 400, 600),
        'max_depth': trial.suggest_int('max_depth', 10, 20),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, .1),
        'subsample': trial.suggest_uniform('subsample', 0.50, 1),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.50, 1),
        'gamma': trial.suggest_int('gamma', 0, 10),
        'tree_method': 'gpu_hist',  
        'objective': 'binary:logistic'
    }
    
    bst = xgb.train(params, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
# trials will be evaluated based on their accuracy on the test set
    accuracy = sklearn.metrics.accuracy_score(y_valid, pred_labels)
    return accuracy

In [None]:
study = optuna.create_study()
study.optimize(objective,n_trials=5) 
#You can increase n_trials parameter

In [None]:
print('Best trial: score {}, params {}'.format(study.best_trial.value, study.best_trial.params))

Set tree_method to gpu_hist to utilize gpu power and it will add some magic!!

In [None]:
best_params = study.best_trial.params
best_params['tree_method'] = 'gpu_hist'      #gpu_hist is really fast
best_params['objective'] = 'binary:logistic'

In [None]:
del x_train, x_valid, y_train, y_valid, dtrain, dvalid  #free some space

In [None]:
# Fit the XGBoost classifier with optimal hyperparameters
clf = xgb.XGBClassifier(**best_params)

In [None]:
%time clf.fit(X, y)  #Used the whole training data

# Fitting classifier on test data

In [None]:
from tqdm import tqdm
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:
for (test_df, pred_df) in tqdm(iter_test):
    if test_df['weight'].item() > 0:
        X_test = test_df.loc[:, test_df.columns.str.contains('feature')]
        y_preds = clf.predict(X_test)
        pred_df.action = y_preds
    else:
        pred_df.action = 0
    env.predict(pred_df)

# If you find this notebook helpful then let me know through comments. It motivates me to put more such work!😊

# Queries/Suggestions are appreciated. Happy Learing!✌