## Assumptions
- This dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade, while ts_id represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.

- In the training set, train.csv, you are provided a resp value, as well as several other resp_{1,2,3,4} values that represent returns over different time horizons. These variables are not included in the test set. Trades with weight = 0 were intentionally included in the dataset for completeness, although such trades will not contribute towards the scoring evaluation.

### Caveats

- Each trade has an associated weight and resp, which together represents a return on the trade ... What's exactly is the "nature" of this association? Are they getting partners in some kind of business or what?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install --upgrade pip

In [None]:
!pip install seaborn==0.11 > /dev/null

In [None]:
import seaborn as sns
sns.set_style('whitegrid')
sns.__version__

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
import plotly as py
import plotly.express as px
import seaborn as sns
!pip install datatable > /dev/null
import datatable as dt
import gc

pd.options.display.max_columns = 999



In [None]:
%%time
train_data = dt.fread('../input/jane-street-market-prediction/train.csv').to_pandas()
features = pd.read_csv('../input/jane-street-market-prediction/features.csv', index_col=0)


In [None]:
#del train_data_datatable
#gc.collect()

In [None]:
## Let get some info from trai_data
train_data.head()

In [None]:
## And what about our "example_test"
example_test = pd.read_csv('../input/jane-street-market-prediction/example_test.csv')
example_test.head()

In [None]:
#____________________________________
# What about features
features.head()

In [None]:
(features*1).T.head()

- I invite you to take a look into "Carl McBride Notebook to get more info on this dataset, he did a great job, 
https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance/notebook
- My conclusions: this dataset migth "hide" some precious information, but so far I've no clue how to exploit it
- I also invite you to look into Mathias notebook to get a better understanding regading "feature_0" he did a great job too
https://www.kaggle.com/nanomathias/feature-0-beyond-feature-0/notebook

- First observation between "train" and "example" dataset, variables (resp_1	resp_2	resp_3	resp_4	resp and date) are not available on example_test, this exclude this variables from our model. 
- Nevertheless, we'll keep "date" in train_data for EDA 

## EDA

In [None]:
# Let me create a sample from train for EDA
train_sample = train_data.sample(frac=.05, random_state=10).copy()

In [None]:
# in addition to :
# "Trades with weight = 0 were intentionally included in the dataset for completeness, 
# although such trades will not contribute towards the scoring evaluation."
train_sample = train_sample[train_sample['weight']!=0]

In [None]:
train_sample.describe()

In [None]:
train_sample.head()

In [None]:
#______________________
# Save samaple_dataset
train_sample.to_csv('/kaggle/working/my_train_sample.csv', index=False)

## Crosscheck both datasets

In [None]:
## Check dtypes 
print('Train sample dtypes: \n{}'.format(train_sample.dtypes.value_counts()))
print('Train data dtypes: \n{}'.format(train_data.dtypes.value_counts()))
print('-'*20)

In [None]:
# Check missing values
print('Columns with NaN (Train): %d' %train_data.isna().any().sum())
print('Columns with NaN (train_sample): %d' %train_sample.isna().any().sum())

- We've one variable missing comparing missing values between source data and our sample, it means one variable with missing values from source dataset doesn't have missing values in my sample

In [None]:
#_____________________
## Plot Missing Values 
def find_missing(data):
    # number of missing values
    count_missing = data.isnull().sum().values
    # total records
    total = data.shape[0]
    # percentage of missing
    ratio_missing = count_missing/total
    # return a dataframe to show: feature name, # of missing and % of missing
    return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing}, index=data.columns.values)


In [None]:
#___________________________
# Plot top 20 missing ratio
find_missing(train_sample).sort_values(by='missing_ratio', ascending=False).head(20)['missing_ratio'].plot.barh(figsize=(12,8))
plt.xlabel("Missing Values (%)")
plt.title("Missing Ratio Train Sample");

- 6 features show a ratio over 16%; from feature 84 to 96 ratio is between 14% and 16 %.
- For Remaining features missing ratio drops below 0.04%, 
- We'll use some missing values inputting strategy to fill the gaps
- Otherwise missing values are relatively balanced between each "group" 
***
# Univariate analysis
### Data Frequency Distribution 

In [None]:
#______________________________
# Plot Numerical
def plot_numerical(data, col, size=[10, 6], bins='auto'):
    sns.set_style('whitegrid')
    #plt.rcParams.update({'font.size': 14, 'font.weight' : 'bold'})
    #'''use this for ploting the distribution of numercial features'''
    plt.figure(figsize=size)
    plt.title("Distribution of %s"%col, fontsize = 22, fontweight="bold")
    sns.histplot(data[col].dropna(), kde=True,bins=bins)
    #plt.title('Distribution des fréquences , fontsize = 22, fontweight="bold")
    #plt.xlabel("Labels", fontsize = 16, fontweight="bold")
    #plt.ylabel("(%)", fontsize = 16, fontweight="bold")plot_numerical(train_sample, 'date')
    #plt.savefig("Distribution %s"%col+".png", bbox_inches = 'tight')
    plt.show()

In [None]:
plot_numerical(train_sample, 'resp')

- Values are mostly frequently distributed between -0,1 and 0.1 with extreme values around +/- 0.4

In [None]:
plot_numerical(train_sample, 'date')

- Timeline goes between 0 and 500 and lowest frequency between day 100 and 150 
- Also meaning, some days have a higher frequency on trades then others

In [None]:
plot_numerical(train_sample, 'weight')

In [None]:
plot_numerical(train_sample, 'feature_1')

## Bivariate Analysis
> Considering :
- "Each row in the dataset represents a trading opportunity, for which you will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade"

- Our target, I mean "what we intend to predict", the action value is the association of weight and resp representing the return on trade 
- I've some doubts in here, as I saw some notebooks using only "resp" as target, 
- In the other hand, "the so called association between resp and weight" is not clearly explained, so if  "Jane Street" could provide some more details on it, that would be great
- Meantime I'll use "resp" as "target"

In [None]:
#________________________________
# Create target and/or "action" 
train_sample['action'] = np.where(train_sample['resp']>0,1,0)

In [None]:
#train_sample.drop(columns=('target'), inplace=True)

## Correlation distribution between Target and:
- Weight
- Resp
- Date 
- Feature_0

We'll come back into this subject after analyzing "correlations between features" which will provide more accurate information on which features we should focus on

In [None]:
def plot_numerical_bylabel(data, col, target, size=[12, 6]):
    plt.figure(figsize=size)
    # Calculate the correlation coefficient between the new variable and the target
    corr = data[target].corr(data[col])
    
    # Calculate medians for repaid vs not repaid
    #avg_repaid = data.loc[data['resp'] <= 0, col].median()
    #avg_not_repaid = data.loc[data['resp'] > 0, col].median()
    
    plt.figure(figsize = (12, 6))
    
    # Plot the distribution for target == 0 and target == 1
    sns.kdeplot(data.loc[data[target] <= 0, col], label = str(target) + '<= 0')
    sns.kdeplot(data.loc[data[target] > 0 , col], label = str(target) + ' > 0')
    
    # label the plot
    plt.xlabel(col); plt.ylabel('Density'); plt.title('%s Distribution' % col)
    plt.legend();
    # print out the correlation
    print('The correlation between %s and the TARGET is %0.4f' % (col, corr));
    # Print out average values

In [None]:
plot_numerical_bylabel(train_sample, 'weight', 'action')

In [None]:
plot_numerical_bylabel(train_sample, 'resp', 'action')

In [None]:
plot_numerical_bylabel(train_sample, 'date', 'action')

In [None]:
plot_numerical_bylabel(train_sample, 'feature_0', 'action')

- Feature_0 and Target have almos sam density for target values greatter then 0 and target values smaller the zero

In [None]:

sns.lmplot('resp','weight', train_sample.head(1000), hue='action',fit_reg=True)

- Bigger weights are commonly associated with resp between +/- .05
- Higher "Weight values" are not only associted with positive response
- Wheight frequency are bewteen 0 and ~130

In [None]:
np.corrcoef(train_sample['feature_0'], train_sample['resp']),np.corrcoef(np.where(train_sample['feature_0']>0,1,0), train_sample['resp']>0)

- Even if they have almost same density, unfortunnately no correlation between these 2 features, "c'est dommage"

## Multivariate analysis - Correlation Matrix

In [None]:
#___________________________
# creat a list of "dropable columns"
#_________________
# Drop unnecessary variables 
drop_col = [col for col in train_sample if col.startswith('resp') or col in ('date','ts_id')]
drop_col

In [None]:
# Find correlations with the target and sort
correlations = train_sample.drop(columns=(drop_col)).corr()['action'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

In [None]:
%%time
#______________________
# Pairplot Top 10 most correlated features 
cols_to_pairplot = pd.DataFrame(correlations.tail(10)).reset_index()['index'].to_list()
sns.pairplot(train_sample.loc[:,cols_to_pairplot], hue = "action")

- So at this point, I starting asking myself questions, 
- We can see we're not able to find distinct "regions" for each label (1 & 0) between these features, which, are mots correlated with our target
- The way target values are distributed with these pair of features looks completely random, even if some regions are more populated than others 
- So, my question, how and which features most contribute to predictions, knowing that none of these seems to be good candidates among most "weak" correlated?
- One idea, is to compare correlation between trading days
- I'll inspire my work from Carl McBride to make my point


In [None]:
train_sample[train_sample['date']==499].head(3)

In [None]:
train_sample[train_sample['date']==125].head(3)

In [None]:
pd.concat([train_sample[train_sample['date']==125], train_sample[train_sample['date']==499]])

In [None]:
pd.concat([train_sample[train_sample['date']==125], train_sample[train_sample['date']==499]]).drop(columns=(drop_col)).head(5).corr(method='pearson').\
style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)

- What these 2 days have most correlated features in common ?
## Find the pairs of features with a correlation > 0.99:

In [None]:
# code from: https://izziswift.com/list-highest-correlation-pairs-from-a-large-correlation-matrix-in-pandas/
def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

In [None]:
#_________________________________
corrFilter(pd.concat([train_sample[train_sample['date']==125], train_sample[train_sample['date']==499]]).drop(columns=(drop_col)), .99).to_frame()

- I need to make some more researches on this subject, as depending on the day of trade on don't get "approximatively" the same features (comparing with McBride results)
- Let's build a baseline Model to get some more information about features

In [None]:
#_______________________________________________
# target Distribution
train_sample['action'].value_counts().plot.bar()

- Our dataset is well balanced, even tough there's a sligthly difference between number on each label, only visible if zooming :)

# Build baseline model

In [None]:
drop_col

In [None]:
drop_col = [col for col in train_sample if col.startswith('resp') or col in ('date','ts_id','action')]
drop_col

In [None]:
#______________________
# Setting variables 
X_train_sample = train_sample.drop(columns=(drop_col))
X_train_sample.head()

In [None]:
y_train_sample = train_sample['action']

In [None]:
y_train_sample[:2]

## Cross validation

In [None]:
from xgboost import XGBClassifier

In [None]:
xgbclass = XGBClassifier(subsample=.5,
                             learning_rate =  0.05,
                             n_estimators = 500,
                             missing = -999,
                             objective = 'binary:logistic'
                             #tree_method = 'gpu_hist'
                            )

In [None]:
from sklearn.metrics import f1_score, roc_auc_score, auc, precision_score, \
recall_score, roc_curve, multilabel_confusion_matrix, classification_report, \
confusion_matrix, precision_recall_fscore_support, plot_roc_curve

In [None]:
from sklearn.metrics.scorer import make_scorer
scoring = {
               'f1': make_scorer(f1_score, average='binary'),
               'roc_auc': make_scorer(roc_auc_score)
               }

In [None]:
from sklearn.model_selection import cross_val_score, cross_validate

In [None]:

%%time
cross_validate(xgbclass, X_train_sample.fillna(-999), y_train_sample, 
               return_train_score=False, return_estimator=False, 
               scoring=scoring)


In [None]:
import numpy as np
np.asarray([0.53149275, 0.52729442, 0.53138409, 0.52497502, 0.53380748]).mean().round(2)

- F1_score of 53%, not so bad for a "blind shoot"

## Split and fit

In [None]:
## Split du dataset en Train et Test avec train test split
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train_sample, y_train_sample, test_size=0.2, 
                                                    random_state=42, shuffle=True, stratify = y_train_sample )

In [None]:
%%time
#___________________________________
# We fit our sample data 
xgbclass.fit(X_train, y_train)

### Features Importance

In [None]:
cols=list(X_train.columns)
tabfe=[]
for i,j in zip(cols,xgbclass.feature_importances_):
    tabfe.append([i,j])
pd.DataFrame(tabfe, columns=('feature','score'))[:30].sort_values(by='score', ascending=False).plot.barh(x='feature',figsize=(12,10))

- surprisingly, only feature_27 (most correlated feature from correlation matrix) seems to be most important feature for this model.

### Confusion Matrix

In [None]:
%%time
preds =  xgbclass.predict(X_test)

In [None]:
print('ROC AUC score: %.3f' 
      %roc_auc_score(y_test, xgbclass.predict(X_test)))

In [None]:
label = [0,1]
cf_matrix = confusion_matrix(y_test, preds, label)
pd.DataFrame(cf_matrix/np.sum(cf_matrix))

## Save model

In [None]:
import pickle
pickle.dump(xgbclass, open('/kaggle/working/xgbclassifier_janestreet.pickle', 'wb'))

## Final model evaluation and submitting 
- Following lines provide information how to submit your work.
- Good luck

In [None]:
# create the environment
#import janestreet
#print('Creating competition environment...', end='')
#env = janestreet.make_env()
#print('Finished.')
#or
#import janestreet
#env = janestreet.make_env() # initialize the environment
#iter_test = env.iter_test() # an iterator which loops over the test set

In [None]:
#features = X_train_sample.columns.to_list()
#features

In [None]:
#X_train.info(verbose=True, null_counts=True)#

In [None]:
#X_train.describe()

In [None]:
#%%time
#___________________________________
# We fit our sample data 
#model.fit(X_train, y_train)
#model.predict(X_test)

print('Creating submissions file...', end='')
rcount = 0
for (test_df, prediction_df) in env.iter_test():
    if test_df['weight'].item() != 0:
        X_test = test_df.loc[:, features]
        #X_test.fillna(-999)
        y_preds = model.predict(X_test)
        prediction_df.action = y_preds
        env.predict(prediction_df)
        rcount += len(test_df.index)
print(f'Finished processing {rcount} rows.')

#%%time
for (test_df, sample_prediction_df) in iter_test:
    #X_test = test_df.loc[:, test_df.drop(columns=(drop_col+['resp']))]
    if test_df['weight'].item() > 0:
        X_test_set = test_df.drop(columns=(['date','ts_id'])).values
        #X_test = test_df
        X_test_set.fillna(-999)
        preds = xgbclass.predict(X_test_set)
        sample_prediction_df.action = preds
        env.predict(sample_prediction_df)
    
    #pred = model.predict(X_test.values.reshape(1, -1))
    #sample_prediction_df.action = transformPred(pred)[0] #make your 0/1 prediction here
    #env.predict(sample_prediction_df)

for (test_df, sample_prediction_df) in iter_test:
    X_test_set = test_df.loc[:, features]
    #X_test_set = test_df.drop(columns=(['date'])).values
    print(X_test_set.columns)
    X_test_set.fillna(-999)
    preds = xgbclass.predict(X_test_set)
    sample_prediction_df.action = preds
    env.predict(sample_prediction_df)

# perform test and create submissions file

print('Creating submissions file...', end='')
rcount = 0
for (test_df, prediction_df) in env.iter_test():
    X_test = test_df.loc[:, features]
    y_preds = clf.predict(X_test)
    prediction_df.action = y_preds
    env.predict(prediction_df)
    rcount += len(test_df.index)
print(f'Finished processing {rcount} rows.')