<a id="0"></a>
# <center style="background-color:#63809e; color:white;">Jane Street Market Prediction</center>

<center><img src="https://www.janestreet.com/assets/logo_horizontal.png" width=70%></center>

### <center style="background-color:yellow; width:150px;">Introduction</center>
In this competition, if one is able to generate a highly predictive model which selects the right trades to execute, they would also be playing an important role in sending the market signals that push prices closer to “fair” values. That is, a better model will mean the market will be more efficient going forward. However, developing good models will be challenging for many reasons, including a very low signal-to-noise ratio, potential redundancy, strong feature correlation, and difficulty of coming up with a proper mathematical formulation.
This is a Code Competition and you need to submit notebooks for evaluation.

### <center style="background-color:yellow; width:150px;">Deadlines</center>
* February 15, 2021 - Entry deadline. You must accept the competition rules before this date in order to compete.
* February 15, 2021 - Team Merger deadline. This is the last day participants may join or merge teams.
* February 22, 2021 - Final submission deadline.

Starting after the final submission deadline there will be periodic updates to the leaderboard to reflect market data updates that will be run against selected notebooks and the competition ends on August 23, 2021 finally.


### <center style="background-color:yellow; width:150px;">Evaluation</center>
This competition is evaluated on a utility score. Each row in the test set represents a trading opportunity for which you will be predicting an action value, 1 to make the trade and 0 to pass on it. Each trade j has an associated weight and resp, which represents a return.

For each date i, we define:
p_i = \sum_j(weight_{ij} * resp_{ij} * action_{ij})

t = \frac{\sum p_i }{\sqrt{\sum p_i^2}} * \sqrt{\frac{250}{|i|}}

where |i| is the number of unique dates in the test set. The utility is then defined as:
u = min(max(t,0), 6)  \sum p_i.




### <center style="background-color:yellow; width:150px;">Submission</center>
You must submit to this competition using the provided python time-series API, which ensures that models do not peek forward in time. To use the API, follow the following template in Kaggle Notebooks:

```python
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

for (test_df, sample_prediction_df) in iter_test:
    sample_prediction_df.action = 0 #make your 0/1 prediction here
    env.predict(sample_prediction_df)
```

### <center style="background-color:yellow; width:150px;">Note</center>
* Your notebook must use the time-series module to make predictions.
* The max limit to use GPU and CPU are 4 hours each for training.
* Forecasting phase will have an additional 10% extra time allowance.
* The maximum team size is 5.
* You may submit a maximum of 5 entries per day.
* You may select up to 2 final Submissions for judging.
* COMPETITION WEBSITE: [https://www.kaggle.com/c/jane-street-market-prediction](https://www.kaggle.com/c/jane-street-market-prediction)



<h4 style="color: red;">The notebook could be a bit slow to load at times as it uses plotly to load images which sometimes slows down page rendering due to intense graphs.</h4>

## <center style="background-color:#6abada; color:white;">Contents in the Notebook 👉</center>

[   About the competition](#0)
1. [Import Libraries 📚](#1)
2. [Import dataset ✍](#2)
3. [Understanding Data Features 📈](#3)
4. [Exploratory Data Analysis (EDA) 📊](#4)
5. [Data Cleaning 🧹](#5)
    * [Imputing Missing Data](#5a)
    * [Remove Outliers](#5b)
    * [Dropping Duplicates](#5c)
6. [Feature Engineering 👷](#6)    
    * [Dimensionality Reduction - PCA](#6a)
7. [Work in Progress 🚧](#1000)


<a id="1"></a>
## <center style="background-color:#6abada; color:white;">Import Libraries 📚</center>

In [None]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 140)

import os
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns

from pandas_profiling import ProfileReport

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.graph_objs as go

import janestreet

from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings("ignore")

<a id="2"></a>
## <center style="background-color:#6abada; color:white;">Importing Data ✍</center>

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/Jane_Street_Sign.jpg/1200px-Jane_Street_Sign.jpg" width=50%></center>

In [None]:
# reading the paths of all the files present in the dataset
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# setting the paths to variables to access when required
TRAINING_PATH = "/kaggle/input/jane-street-market-prediction/train.csv"
FEATURES_PATH = "/kaggle/input/jane-street-market-prediction/features.csv"
TEST_PATH = "/kaggle/input/jane-street-market-prediction/example_test.csv"
SAMPLE_SUB_PATH = "/kaggle/input/jane-street-market-prediction/example_sample_submission.csv"

### <center style="background-color:yellow; width:150px;">Note</center>
1. The size of dataframe in this competition is huge, so using dask to read dataset could be a good option. Just for analysis we are using pandas to read a small set of the data to understand it well.<br><br>
2. To read more about dask and how you can use it, read this article <a href="https://towardsdatascience.com/dask-a-guide-to-process-large-datasets-using-parallelization-c5554889abdb">here.</a>


In [None]:
# reading data using pandas.

n_rows = 200000

train_df = pd.read_csv(TRAINING_PATH, nrows=n_rows)
features_df = pd.read_csv(FEATURES_PATH, nrows=n_rows)
test_df = pd.read_csv(TEST_PATH, nrows=n_rows)

train_df.head()

In [None]:
train_df.shape

<a id="3"></a>
## <center style="background-color:#6abada; color:white;">Understanding Data Features 📈</center>

### <center style="background-color:yellow; width:300px;">Data Types</center>

In [None]:
print("Train data set dtypes: \n")
print(f"Total Cols: {len(train_df.columns)}")
print(f"{train_df.dtypes.value_counts()}")
print('-'*30)

print("Features data set dtypes: \n")
print(f"Total Cols: {len(features_df.columns)}")
print(f"{features_df.dtypes.value_counts()}")
print('-'*30)

print("Test data set dtypes: \n")
print(f"Total Cols: {len(test_df.columns)}")
print(f"{test_df.dtypes.value_counts()}")

In [None]:
train_df.describe()

### <center style="background-color:yellow; width:250px;">Features Correlation</center>


In [None]:
correlations = train_df.corr(method='pearson')

In [None]:
fig, axs = plt.subplots(figsize=(16, 16))
sns.heatmap(correlations)
fig.savefig('correlation_map.png')

### <center style="background-color:yellow; width:300px;">Missing Values</center>
#### <center style="background-color:purple; color:white; width:200px;">Missing Valued Cols</center>

In [None]:
#  Missing Values
print('Train Nan Valued colas: %d' %train_df.isna().any().sum())
print('Features Nan Valued cols: %d' %features_df.isna().any().sum())
print('Test Nan Valued cols: %d' %test_df.isna().any().sum())

#### <center style="background-color:purple; color:white; width:200px;">Missing Valued Rows</center>

In [None]:
n_features = 40
nan_val = train_df.isna().sum()[train_df.isna().sum() > 0].sort_values(ascending=False)
print(nan_val)


fig, axs = plt.subplots(figsize=(10, 10))

sns.barplot(y = nan_val.index[0:n_features], 
            x = nan_val.values[0:n_features], 
            alpha = 0.8
           )

plt.title(f'NaN values of train dataset (Top {n_features})')
plt.xlabel('NaN values')
fig.savefig(f'nan_values_top_{n_features}_features.png')
plt.show()

<a id="4"></a>
## <center style="background-color:#6abada; color:white;">Exploratory Data Analysis 📊</center>

### <center style="background-color:yellow; width:500px;">Understanding distributions based using Plots</center>
#### <center style="background-color:purple; color:white; width:280px;">Weight and Resp Distribution Plots</center>

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(16, 6))
sns.distplot(train_df['resp'], ax=axs[0])
sns.distplot(train_df['weight'], ax=axs[1])
fig.savefig('resp_weight_distplot.png')

#### <center style="background-color:purple; color:white; width:200px;">Resp_i[](http://) Distribution Plots</center>

In [None]:
fig, axs = plt.subplots(1, 4, figsize=(16, 6))
sns.distplot(train_df['resp_1'], ax=axs[0])
sns.distplot(train_df['resp_2'], ax=axs[1])
sns.distplot(train_df['resp_3'], ax=axs[2])
sns.distplot(train_df['resp_4'], ax=axs[3])
fig.savefig('resp_distplot.png')

#### <center style="background-color:purple; color:white; width:150px;">Resp Violin Plots</center>

In [None]:
fig, ax = plt.subplots(figsize=(16, 12))
sns.violinplot(data=train_df[["resp_1", "resp_2", "resp_3", "resp_4", "resp"]], 
               inner="points", 
               linewidth=1, 
               palette="Set3", 
               ax=ax)    
fig.savefig('resp_violinplot.png')

#### <center style="background-color:purple; color:white; width:250px;">Date vs ts_id Contours Plot</center>

In [None]:
fig = px.density_contour(train_df, x="date", y="ts_id")
fig.update_traces(contours_coloring="fill", contours_showlabels = True)
fig.show()

In [None]:
## TODO: uncomment the below if you want to visualize the distribution plots of resp features.


# hist_data = [train_df["resp"], train_df["resp_1"], train_df["resp_2"], train_df["resp_3"], train_df["resp_4"]]
# group_labels = ['resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4']

# # Create distplot with curve_type set to 'normal'
# fig = ff.create_distplot(hist_data, group_labels, curve_type='normal', show_hist=False)
# fig.show()

#### <center style="background-color:purple; color:white; width:250px;">Feature_i Histogram Plots</center>

In [None]:
date = 0
n_features = 130

cols = [f'feature_{i}' for i in range(1, n_features)]
hist = px.histogram(
    train_df[train_df["date"] == date], 
    x=cols, 
    animation_frame='variable', 
    range_y=[0, 600], 
    range_x=[-7, 7]
)

hist.show()

In [None]:
## TODO: uncomment the below if you want to visualize the scatter plot between weights and resp features.

# date = 0
# n_features = 5
# cols = [f'resp_{i}' for i in range(1, n_features)]
# cols.insert(0, "resp")

# sctr = px.scatter(train_df[train_df["date"] == date], 
#                      x=cols, 
#                      y="weight", 
#                      size="weight", 
#                      color="date",
#                      hover_name="weight", 
#                      animation_frame='variable',
#                      log_x=True, 
#                      size_max=40)

# sctr.show()

### <center style="background-color:yellow; width:250px;">Understanding Data Spread</center>

In [None]:
feature_cols = [c for c in train_df.columns if 'feature' in c]

for f in feature_cols:
    fig, axs = plt.subplots(1, 4, figsize=(16, 5))
    
    # plot 1
    sns.distplot(train_df[f], ax=axs[0])
    
    # plot 2
    sns.distplot(train_df.query('weight > 0')[f], ax=axs[1])
    
    # plot 3
    try:
        sns.distplot(train_df.query('weight > 0 and resp > 0')[f].dropna().apply(np.log1p), ax=axs[2])
        sns.distplot(train_df.query('weight > 0 and resp < 0')[f].dropna().apply(np.log1p), ax=axs[2])
    except:
        pass
    
    # plot 4
    train_df.plot(kind='scatter', x=f, y='resp', ax=axs[3])
    
    fig.suptitle(f, fontsize=15, y=1.1)
    axs[0].set_title('Feature Distribution (all weights)')
    axs[1].set_title('Feature Distribution (weights > 0)')
    axs[2].set_title('Log Transform')
    axs[3].set_title('Feature vs. Response')
    
    plt.tight_layout()
    plt.show()

### <center style="background-color:yellow; width:250px;">Understanding Outliers </center>

In [None]:
def plot_outlier_graph_set(df):    
    feature_cols = [c for c in df.columns if 'feature' in c]

    for f in feature_cols:
        fig, axs = plt.subplots(1, 4, figsize=(16, 5))

        # plot 1
        sns.boxplot(y=f, data=df, ax=axs[0])

        # plot 2
        sns.boxenplot(y=f, data=df, ax=axs[1])

        # plot 3
        sns.violinplot(y=f, data=df, ax=axs[2]) 

        # plot 4
        sns.stripplot(y=f, data=df, size=4, color=".3", linewidth=0, ax=axs[3])


        fig.suptitle(f, fontsize=15, y=1.1)
        axs[0].set_title('Box Plot')
        axs[1].set_title('Boxen Plot')
        axs[2].set_title('Violin Plot')
        axs[3].set_title('Strip Plot')

        plt.tight_layout()
        plt.show()

#### <center style="background-color:purple; color:white; width:150px;">Unclean Plots</center>

In [None]:
plot_outlier_graph_set(train_df)

<a id="5"></a>
## <center style="background-color:#6abada; color:white;">Data Cleaning 🧹</center>

<a id="5a"></a>
## <center style="background-color:yellow; width:300px;">a) Imputing Missing Data</center>
### <center style="background-color:yellow; width:250px;">Simple Imputer</center>

In [None]:
imputer = SimpleImputer(strategy='mean')
imputed_train_df = pd.DataFrame(imputer.fit_transform(train_df))

imputed_train_df.columns=train_df.columns
imputed_train_df.index=train_df.index

print(f"Is there any missing values left? {imputed_train_df.isna().sum().any()}")
imputed_train_df.head()

<a id="5b"></a>
## <center style="background-color:yellow; width:350px;">b) Remove Outliers</center>
### <center style="background-color:yellow; width:250px;">Z-score calculations</center>

In [None]:
# Calculating the Z score to calculate outliers

threshold = 4

z = np.abs(stats.zscore(imputed_train_df, nan_policy='omit'))
print(f"Z score array:\n{z}")
print('-'*30)

row_index = np.where(z > threshold)
print(f"Outlier Data Rows: {len(set(row_index[0]))}")
print('-'*30)

print(f"Outlier Rows Index:\n{row_index}")
print('-'*30)

print(f"Sample outlier row data:\n{z[row_index[0][0]]}")

In [None]:
clean_train_df = imputed_train_df[(z < threshold).all(axis=1)].reset_index(drop=True)
clean_train_df

<a id="5c"></a>
## <center style="background-color:yellow; width:350px;">c) Dropping Duplicates</center>

In [None]:
# dropping duplicate values 
clean_train_df.drop_duplicates(keep=False,inplace=True)
clean_train_df

<div><b>Note: The data does not have any duplicates as the size of the dataframe does not change.</b></div>

#### <center style="background-color:purple; color:white; width:150px;">Clean Plots</center>

In [None]:
plot_outlier_graph_set(clean_train_df)

<a id="6"></a>
## <center style="background-color:#6abada; color:white;">Feature Engineering 👷</center>

<a id="6a"></a>
### <center style="background-color:yellow; width:350px;">Principal Component Analysis</center>

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_rescaled = scaler.fit_transform(clean_train_df)

In [None]:

def plot_PCA_threshold(threshold, data_rescaled):
#     pca = PCA(n_components=threshold)
#     pca.fit(data_rescaled)
#     reduced = pca.transform(data_rescaled)

    pca = PCA().fit(data_rescaled)

    fig, ax = plt.subplots(figsize=(16, 6))
    y = np.cumsum(pca.explained_variance_ratio_)
    xi = np.arange(1, len(y)+1, step=1)

    # print(f"X vals:\n{xi}")
    # print('-'*30)
    # print(f"Y vals:\n{y}")
    # print('-'*30)

    plt.ylim(0.0,1.1)
    plt.plot(xi, y, marker='o', linestyle='--', color='b')

    plt.xlabel('Number of Components')
    plt.xticks(np.arange(0, len(y), step=5)) #change from 0-based array index to 1-based human-readable label
    plt.ylabel('Cumulative variance (%)')
    plt.title('The number of components needed to explain variance')

    plt.axhline(y=threshold, color='r', linestyle='-')
    plt.text(0.5, 0.85, f'{threshold*100}% cut-off threshold', color = 'red', fontsize=16)

    ax.grid(axis='x')
    fig.savefig(f"PCA_threshold_{threshold*100}p.png")
    plt.show()

In [None]:
thresholds = [0.90, 0.95, 0.97, 0.99]

for th in thresholds:
    plot_PCA_threshold(th, data_rescaled)

<center style="color:black; background-color:yellow;"><h4>Based on above plots we can choos how many number of components we need to use as final reduced featues after PCA, here we choose 35 components based on the 95% threshold.</h4></center>

In [None]:
chosen_threshold = 0.95
n_components=35

feature_cols = [c for c in train_df.columns if 'feature' in c]
new_cols=[]
for i in range(n_components):
    new_cols.append(f"new_feature_{i}")
    
print("Principal Component Analysis (PCA)")
print('-'*30)
print(f"New Columns:\n{new_cols}")
print('-'*30)


pca = PCA(n_components=n_components)
principalComponents = pca.fit_transform(clean_train_df[feature_cols])
principal_df = pd.DataFrame(data=principalComponents, columns=new_cols )

print(f"PCA Explained Variance Ratio:\n{pca.explained_variance_ratio_}")
print('-'*30)
print(f"PCA Singular Values:\n{pca.singular_values_}")
print('-'*30)
principal_df.head()

In [None]:
squzzed_train_df = pd.concat([principal_df, clean_train_df[['date', 'weight', 'resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4', 'ts_id']]], axis = 1)
squzzed_train_df.head()

In [None]:
squzzed_train_df.to_csv(f"PCA_reduced_{chosen_threshold*100}p_{n_rows}x{n_components}.csv", index=False)

In [None]:
# fig, axs = plt.subplots(figsize=(6, 6))
# sns.scatterplot(data=squzzed_train_df, 
#                 x=new_cols[0], 
#                 y=new_cols[1], 
#                 hue="weight", 
#                 size="weight")

# fig.savefig("PCA_f1_f2_scatterplot.png")

In [None]:
feature_cols = [c for c in squzzed_train_df.columns if 'feature' in c]

for f in feature_cols:
    fig, axs = plt.subplots(1, 4, figsize=(16, 5))
    
    # plot 1
    sns.distplot(squzzed_train_df[f], ax=axs[0])
    
   
    # plot 2
    try:
        sns.distplot(squzzed_train_df.query('weight > 0 and resp > 0')[f].dropna().apply(np.log1p), ax=axs[1])
        sns.distplot(squzzed_train_df.query('weight > 0 and resp < 0')[f].dropna().apply(np.log1p), ax=axs[1])
    except:
        pass
    
     # plot 3
    sns.violinplot(y=f, data=squzzed_train_df, palette="Set3", linewidth=0.5, inner="points", ax=axs[2]) 

    # plot 4
    squzzed_train_df.plot(kind='scatter', x=f, y='resp', ax=axs[3])
    
    fig.suptitle(f, fontsize=15, y=1.1)
    axs[0].set_title('Feature Distribution (all weights)')
    axs[1].set_title('Feature Violin Plot')
    axs[2].set_title('Log Transform')
    axs[3].set_title('Feature vs. Response')
    
    plt.tight_layout()
    plt.show()

In [None]:
sns.pairplot(squzzed_train_df, 
             vars=new_cols)

fig.savefig('PCA_new_features_pairplot.png')

### [Back to the Top 👆](#0)

<a id="1000"></a>

<center><h2>Notebook Under Development</h2></center>
<img src="https://cdn1.iconfinder.com/data/icons/construction-220/64/43-512.png" width=100 height=100>
<center><h4>I hope it was helpful!!</h4></center>
