# Introduction

Volatility is one of the most commonly heard phrases on the trading floor, and for good reason. Volatility is a term used in financial markets to describe how much prices fluctuate. Market instability and significant price fluctuations are linked with high volatility, whereas calm and peaceful markets are associated with low volatility.The trading of options, whose price is directly connected to the volatility of the underlying product, necessitates precise volatility prediction for trading businesses like Optiver.


<font color = 'blue'>
Content:
    
1. [Load and Check Data](#1)
1. [Variable Description](#2)  
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable](#4)
        * [Numerical Variable](#5)
1. [Basic Data Analysis](#6)
1. [Outlier Detection](#7)
1. [Missing Value](#8)
    * [Find Missing Value](#9)
    * [Fill Missing Value](#10)
1. [Visualization](#11)
    * [Correlation Between time_id -- stock_id -- target](#12)
    * [stock_id -- target](#13)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "1"></a><br>
# Load and Check Data

In [None]:
train_df = pd.read_csv("/kaggle/input/optiver-realized-volatility-prediction/train.csv")
test_df = pd.read_csv("/kaggle/input/optiver-realized-volatility-prediction/test.csv")
test_stock_id = test_df["stock_id"]

In [None]:
train_df.columns

In [None]:
train_df.head()

In [None]:
train_df.describe()

<a id = "2"></a><br>
# Variable Description
1. stock_id: unique id number for each stock
1. time_id: unique id number for measuring the time
1. target: a specific group of individuals to whom or for whom something is aimed or intended

In [None]:
train_df.info()

* float64(1): target
* int64(2): stock_id and time_id

<a id = "3"></a><br>
## Univariate Variable Analysis
* Categorical Variable: target
* Numerical Variable: stock_id and time_id

<a id = "4"></a><br>
### Categorical Variable

In [None]:
def bar_plot(variable):
    """
        input: variable ex: "target"
        output: bar plot & value count
    """
    # get feature
    var = train_df[variable]
    # count number of vategorical variable(value/sample)
    varValue = var.value_counts()
    
    #visualize
    plt.figure(figsize = (60,40))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
category1 = ["stock_id"]
for c in category1:
    bar_plot(c)

In [None]:
category2 = ["target","stock_id","time_id"]
for c in category2:
    print("{} \n".format(train_df[c].value_counts()))

<a id = "5"></a><br>
### Numerical Variable

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(train_df[variable], bins = 50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distrinübution with hist".format(variable))
    plt.show()

In [None]:
numericVar = ["stock_id", "time_id", "target"]
for n in numericVar:
    plot_hist(n)

<a id = "6"></a><br>
# Basic Data Analysis
* stock_id - target
* time_id - target

In [None]:
train_df[["time_id","target"]].groupby(["time_id"], as_index = False).mean().sort_values(by="target",ascending = False)

In [None]:
train_df[["stock_id","target"]].groupby(["stock_id"], as_index = False).mean().sort_values(by="target",ascending = False)

<a id = "7"></a><br>
# Outlier Detection

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3nd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indices
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indices
        outlier_indices.extend(outlier_list_col)
        
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
   
    return multiple_outliers

In [None]:
train_df.loc[detect_outliers(train_df,["time_id","stock_id","target"])]

In [None]:
# drop outliers
train_df = train_df.drop(detect_outliers(train_df,["time_id","stock_id","target"]), axis = 0).reset_index(drop = True)

<a id = "8"></a><br>
# Missing Value
* Find Missing Value
* Fill Missing Value

In [None]:
train_df_len = len(train_df)
train_df = pd.concat([train_df,test_df],axis = 0).reset_index(drop = True)

<a id = "9"></a><br>
## Find Missing Value

In [None]:
train_df.columns[train_df.isnull().any()]

In [None]:
train_df.isnull().sum()

<a id = "10"></a><br>
## Fill Missing Value
* target has 3 missing values

In [None]:
train_df[train_df["target"].isnull()]

<a id = 11></a><br>
# Visualization

<a id = 12></a><br>
## Correlation Between time_id -- stock_id -- target

In [None]:
list1 = ["time_id","stock_id","target"]
sns.heatmap(train_df[list1].corr(), annot = True, fmt = ".2f")
plt.show()

stock_id feature seems to have correlation with target feature(-0.02)

<a id = 13></a><br>
## stock_id -- target

In [None]:
g = sns.factorplot(x = "stock_id", y = "target", data = train_df, kind = "bar", size = 30)
g.set_ylabels("stock/target")
plt.show()

* when stock_id == 28 or 43 or 125, stock/target rate has a less ratio among others
* stock/target rate has more stock_id unit that includes the most ones in the first half