# Exploratory Data Analysis

<img src='https://media-exp1.licdn.com/dms/image/C4E0BAQH6sW40Sn5dPQ/company-logo_200_200/0?e=2159024400&v=beta&t=S0dyG_8Ox_WZ7a86cFw9Uy-rMwvs6XDONzh_zO04Pz8'>
<h1><center>Jane Street Market Prediction - EDA</center><h1>
    
# 1. <a id='Introduction'>Introduction</a>

###  1.1 What is Jane Street?
[Jane Street](https://www.janestreet.com/) is a quantitative trading firm with a unique focus on technology and collaborative problem solving.

In [None]:
from IPython.display import HTML
HTML('<center><iframe width="560" height="315" src="https://www.janestreet.com/join-jane-street/get-to-know-us/?wvideo=097tvs7n47" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></center>')

###  1.2 What is Jane Street Market Prediction Competition?
- In reality, trading for profit has always been a difficult problem to solve, even more so in today’s fast-moving and complex financial markets. Electronic trading allows for thousands of transactions to occur within a fraction of a second, resulting in nearly unlimited opportunities to potentially find and take advantage of price differences in real time.
- In a perfectly efficient market, buyers and sellers would have all the agency and information needed to make rational trading decisions. As a result, products would always remain at their “fair values” and never be undervalued or overpriced. However, financial markets are not perfectly efficient in the real world for a number of reasons.
- Even if a strategy is profitable now, it may not be in the future, and market volatility makes it impossible to predict the profitability of any given trade with certainty. As a result, it can be hard to distinguish good luck from having made a good trading decision. 

### 1.3 General Information about Electronic Trading.
- [Electronic Trading](https://capital.com/electronic-trading-definition) refers to a method of trading securities, financial derivatives or foreign exchange electronically. Both buyers and sellers use the internet to connect to a trading platform such as an exchange-based system or electronic communication network (ECN).

### 1.4 What do we need to predict?
- Our challenge will be to use the historical data, mathematical tools, and technological tools at our disposal to create a model that gets as close to certainty as possible. We will be presented with a number of potential trading opportunities, which your model must choose whether to accept or reject. 
- So, if we generate a highly predictive model which selects the right trades to execute, we’ll also be playing an important role in sending the market signals that push prices closer to “fair” values. A better model will mean the market will be more efficient going forward.

### 1.5 Metric: Utility-Score
- In economics, [utility function](https://www.investopedia.com/ask/answers/072915/what-utility-function-and-how-it-calculated.asp) is an important concept that measures preferences over a set of goods and services. Utility represents the satisfaction that consumers receive for choosing and consuming a product or service.Utility is measured in units called utils, but calculating the benefit or satisfaction that consumers receive from is abstract and difficult to pinpoint. As a result, economists measure utility in terms of revealed preferences by observing consumers' choices. From there, economists create an ordering of consumption baskets from least desired to the most preferred.
- So, back to how this metric being used here,Each row in the test set represents a trading opportunity for which we will be predicting an action value, 1 to make the trade and 0 to pass on it. Each trade j has an associated weight and resp, which represents a return. For each date i, we define:

<center> $ \large p_{i} = \Sigma_{j}(weight_{ij}∗resp_{ij}∗action_{ij}) $ </center>


<center> $ \large t = \frac{\Sigma_{p_{i}}}{\sqrt{\Sigma p_{i}*p_{i}}} * \sqrt{\frac{250}{|i|}} $ </center>

where |i| is the number of unique dates in the test set. The utility is then defined as:

<center> $ \large u = min(max(t,0),6)\Sigma p_{i}  $ </center>

# 2. <a id='importing'>Importing the necessary libraries📗</a> 

In [None]:
import os
from os import listdir
import pandas as pd
import numpy as np
import glob
import tqdm
from typing import Dict
import matplotlib.pyplot as plt
import pandas_profiling as pdp
import json
%matplotlib inline
import shapely.geometry as sg
import shapely.ops as so
import zipfile
import cv2

#jane-street
import janestreet
env = janestreet.make_env() # initialize the environment
iter_test = env.iter_test() # an iterator which loops over the test set

#plotly
!pip install chart_studio
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

#seaborn
import seaborn as sns

#color
from colorama import Fore, Back, Style

#networkx
import networkx as nx

import seaborn as sns
sns.set(style="whitegrid")

#tifffile
from PIL import Image
import tifffile as tiff
import cv2
from tqdm.notebook import tqdm
import zipfile

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# Settings for pretty nice plots
plt.style.use('fivethirtyeight')
plt.show()

# 3. <a id='reading'>Reading the train.csv 📚</a>

In [None]:
# List files available
list(os.listdir("../input/jane-street-market-prediction"))

In [None]:
train = pd.read_csv('../input/jane-street-market-prediction/train.csv')
features = pd.read_csv('../input/jane-street-market-prediction/features.csv')
ex_test = pd.read_csv('../input/jane-street-market-prediction/example_test.csv')
ex_sub = pd.read_csv('../input/jane-street-market-prediction/example_sample_submission.csv')
print(Fore.YELLOW + 'Training data shape: ',Style.RESET_ALL,train.shape)
print(Fore.YELLOW + 'Features data shape: ',Style.RESET_ALL,features.shape)
print(Fore.YELLOW + 'Test data shape: ',Style.RESET_ALL,ex_test.shape)
train.head()

In [None]:
features.head(5)

In [None]:
train.groupby(['date']).count()['resp'].to_frame()

# 4. Basic Data Exploration

## General Information

In [None]:
# Null values and Data types
print(Fore.YELLOW + 'Train Set !!',Style.RESET_ALL)
print(train.info())
print('-------------')
print(Fore.BLUE + 'Test Set !!',Style.RESET_ALL)
print(ex_test.info())
print('-------------')
print(Fore.GREEN + 'Feature Set !!',Style.RESET_ALL)
print(features.info())

### Missing values

In [None]:
features.isna().sum()

In [None]:
train.isna().sum()

We can see some missing values in train set.

In [None]:
!ls ../input/jane-street-market-prediction/janestreet

In [None]:
print(Fore.YELLOW +"Total Dates in Train set: ",Style.RESET_ALL,train['date'].count())
print(Fore.BLUE +"Total Dates in Test set: ",Style.RESET_ALL,ex_test['date'].count())

## Unique Dates(Ids)

In [None]:
print(Fore.YELLOW + "The total dates in train set are",Style.RESET_ALL,f"{train['date'].count()},", Fore.BLUE + "from those the unique dates are", Style.RESET_ALL, f"{train['date'].value_counts().shape[0]}.")

In [None]:
print(Fore.YELLOW + "The total dates in test set are",Style.RESET_ALL,f"{ex_test['date'].count()},", Fore.BLUE + "from those the unique dates are", Style.RESET_ALL, f"{ex_test['date'].value_counts().shape[0]}.")

In [None]:
train_dates = set(train['date'].unique())
test_dates = set(ex_test['date'].unique())

train_dates.intersection(test_dates)

We see `3` dates in test set that can be found in train set.

In [None]:
columns = train.keys()
columns = list(columns)
print(columns)