## Data Description

This dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade, while ts_id represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.

In the training set, train.csv, you are provided a resp value, as well as several other resp_{1,2,3,4} values that represent returns over different time horizons. These variables are not included in the test set. Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation.


For extensive data analysis for the jane street market dataset go to following link [EDA-Quantitative-Researcher-prespective](https://www.kaggle.com/huzzefakhan/eda-quantitative-researcher-prespective) this notebook will go through the train and features csv's for an extensive exploratory data analysis, Also some data cleaning and preprocessing will be done along the way. In this note book I use XG boot to identify features to be used for supervised learning. This is first version of modeling i am also tring to add some finding from my EDA in this modeling exercise.


## About me

Working as Data Scientist in IT firm in Pakistan. I was Recently Enguaged with Radix Trading LLC which is a firm just like Jane Street which also work in High frequency algorithmic trading. Where i worked as Quantitative Researcher (Quant) to Capture Price movement in High frequency Algorithmic trading through Alphas. Designed many successful Alpha/strategies which is trade-able in real Stock market. For more details kindly visit my linkedin profile.

Please upvote if you find this notebook helpful! 😊 Thank you!.


In [None]:
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import datatable as dt
import xgboost as xgb
import optuna

In [None]:
# Import dataset as train
train = dt.fread('../input/jane-street-market-prediction/train.csv').to_pandas()
train.info()

In [None]:
train.head()

In [None]:
train = train[train["date"] == 1]

In [None]:
# Drop rows with 'weight'=0 
# such trades will not contribute towards the scoring evaluation
train = train[train['weight']!=0]

# Create 'action' column (dependent variable)
train['action'] = train['resp'].apply(lambda x:x>0).astype(int)

In [None]:
features = [col for col in list(train.columns) if 'feature' in col]

In [None]:
X = train[features]
y = train['action']

# Next, we hold out part of the training data to form the hold-out validation set
x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state = 42)


In [None]:
# First, we want to check if the target class is balanced or unbalanced in the training data
sns.set_palette("colorblind")
ax = sns.barplot(y_train.value_counts().index, y_train.value_counts()/len(y_train))
ax.set_title("Proportion of trades with action=0 and action=1")
ax.set_ylabel("Percentage")
ax.set_xlabel("Action")
sns.despine();
# Target class is fairly balanced with almost 50% of trades corresponding to each action

In [None]:
train_median = x_train.median()
# Impute medians in both training set and the hold-out validation set
x_train = x_train.fillna(train_median)
x_valid = x_valid.fillna(train_median)

In [None]:
dtrain = xgb.DMatrix(x_train, label=y_train)
dvalid = xgb.DMatrix(x_valid, label=y_valid)

In [None]:
def objective(trial):
    
# params specifies the XGBoost hyperparameters to be tuned
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 200, 600),
        'max_depth': trial.suggest_int('max_depth', 10, 25),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
        'subsample': trial.suggest_uniform('subsample', 0.50, 1),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.50, 1),
        'gamma': trial.suggest_int('gamma', 0, 10),
        'tree_method': 'gpu_hist',  
        'objective': 'binary:logistic'
    }
    
    bst = xgb.train(params, dtrain)
    preds = bst.predict(dvalid)
    pred_labels = np.rint(preds)
    accuracy = sklearn.metrics.accuracy_score(y_valid, pred_labels)
    return accuracy

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=25, timeout=600)

print("Number of finished trials: ", len(study.trials))
print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

In [None]:
best_params = trial.params
best_params['tree_method'] = 'gpu_hist' 
best_params['objective'] = 'binary:logistic'

In [None]:
# Fit the XGBoost classifier with optimal hyperparameters
clf = xgb.XGBClassifier(**best_params)

In [None]:
clf.fit(x_train, y_train)

In [None]:
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:
sample_prediction_df = pd.read_csv('../input/jane-street-market-prediction/example_sample_submission.csv')

In [None]:
th = 0.5
for (test_df, pred_df) in tqdm(iter_test):
    if test_df['weight'].item() > 0:
        X_test = test_df.loc[:, test_df.columns.str.contains('feature')]
        y_preds = clf.predict(X_test)
        pred_df.action = np.where(y_preds >= th, 1, 0).astype(int)
    else:
        pred_df.action = 0
    env.predict(pred_df)
