<a id="0"></a>
# <center style="background-color:#656d73; color:white;">Jane Street Market Prediction</center>

<center><img src="https://www.janestreet.com/assets/logo_horizontal.png" width=70%></center>

### <center style="background-color:white; width:150px;">References</center>
Jane Street Market EDA: https://www.kaggle.com/blurredmachine/jane-street-market-eda-viz-prediction

Jane Street: EDA of day 0 and feature importance: https://www.kaggle.com/carlmcbrideellis/jane-street-eda-of-day-0-and-feature-importance

## <center style="background-color:#656d73; color:white;">Contents in the Notebook </center>
1. [Import Libraries 📚](#1)
2. [Import dataset ✍](#2)
3. [Understanding Data Features 📈](#3)
4. [Exploratory Data Analysis (EDA) 📊](#4)

<a id="1"></a>
## <center style="background-color:#656d73; color:white;">Import Libraries 📚</center>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
import datatable as dt
import gc
from sklearn.ensemble import RandomForestRegressor
import eli5
from eli5.sklearn import PermutationImportance
import numpy as np
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="2"></a>
## <center style="background-color:#656d73; color:white;">Importing Data ✍</center>

<center><img src="https://media.licdn.com/dms/image/C4E1BAQHbbXAVafylxA/company-background_10000/0?e=2159024400&v=beta&t=k-Znz1NioMxVcmnZULjORawONhIWxh3oj82qEkhTXrw" width=30%></center>

In [None]:
%%time
# read large vol data by datatable

train_data_datatable = dt.fread('/kaggle/input/jane-street-market-prediction/train.csv')
train = train_data_datatable.to_pandas()
train.head()

In [None]:
features = pd.read_csv('/kaggle/input/jane-street-market-prediction/features.csv',index_col=0)
features

In [None]:
test = pd.read_csv('/kaggle/input/jane-street-market-prediction/example_test.csv')
test.head()

<a id="3"></a>
## <center style="background-color:#656d73; color:white;">Understanding Data Features 📈</center>

In [None]:
# Show Data Types
print("Train data set dtypes: \n")
print(f"Total Cols: {len(train.columns)}")
print(f"{train.dtypes.value_counts()}")
print('-'*30)

print("Features data set dtypes: \n")
print(f"Total Cols: {len(features.columns)}")
print(f"{features.dtypes.value_counts()}")
print('-'*30)

print("Test data set dtypes: \n")
print(f"Total Cols: {len(test.columns)}")
print(f"{test.dtypes.value_counts()}")

<a id="4"></a>
## <center style="background-color:#656d73; color:white;">Exploratory Data Analysis 📊</center>

In [None]:
# first look for the cumulative values of resp over time
fig, ax = plt.subplots(figsize=(15, 5))
balance= pd.Series(train['resp']).cumsum()
ax.set_xlabel ("Trade", fontsize=18)
ax.set_ylabel ("Cumulative resp", fontsize=18);
balance.plot(lw=3);
del balance
gc.collect();

In [None]:
# drop variables that are not applied in the analysis
train.drop(['resp_1','resp_2','resp_3','resp_4'],axis=1, inplace =True)
train.head()

In [None]:
# set ts_id as index
train.set_index(['ts_id'])

In [None]:
# check dist of target variable
sns.distplot(train['resp'],axlabel='Histogram of the resp values')

In [None]:
print("Skew of resp is:      %.2f" %train['resp'].skew() )
print("Kurtosis of resp is: %.2f"  %train['resp'].kurtosis() )

# This distribution has very long tails

In [None]:
# check dist of date
sns.distplot(train['date'])

In [None]:
%%time
# find pairs of features with high correlation in day_0 (not whole dataset)
day_0 = train.loc[train['date'] == 0]
plt.figure(figsize=(130,130))
sns.heatmap(day_0.loc[:,'feature_0':'feature_129'].corr(),annot=True,cmap='viridis',linewidths=.5)

In [None]:
%%time
def get_redundant_pairs(df):

    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n):
    
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(day_0, 50))

In [None]:
%%time
# convert features dataframe to binary
features= features*1
# plot a transposed dataframe
features.T.style.background_gradient(cmap='Oranges')

In [None]:
tag_sum = pd.DataFrame(features.T.sum(axis=0),columns=['Number of tags'])
tag_sum.T

In [None]:
X_train = day_0.loc[:,'feature_0':'feature_129']
X_train = X_train.fillna(X_train.mean())
y_train = day_0["resp"]

In [None]:
%%time
# quick Permutation Importance using the Random Forest
regressor = RandomForestRegressor(max_features='auto')
regressor.fit(X_train, y_train)
perm_import = PermutationImportance(regressor, random_state=1).fit(X_train, y_train)
# visualize the results
eli5.show_weights(perm_import, top=50, feature_names = X_train.columns.tolist())

In [None]:
%%time
# missing values
n_features = 50
nan_val = train.isna().sum()[train.isna().sum() > 0].sort_values(ascending=False)
print(nan_val)
fig, axs = plt.subplots(figsize=(10, 10))
sns.barplot(y = nan_val.index[0:n_features], 
            x = nan_val.values[0:n_features], 
            alpha = 0.8
           )
plt.title('Missing values of train dataset')
plt.xlabel('# of Missing values')
plt.show()

### [Back to the Top 👆](#1)

<a id="1000"></a>

<center><h2>Notebook Under Development</h2></center>
<img src="https://cdn1.iconfinder.com/data/icons/construction-220/64/43-512.png" width=100 height=100>
<center><h4>I hope it was helpful!!</h4></center>
