<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Data-I/O" data-toc-modified-id="Data-I/O-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data I/O</a></span><ul class="toc-item"><li><span><a href="#[a]" data-toc-modified-id="[a]-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>[a]</a></span></li><li><span><a href="#[b]" data-toc-modified-id="[b]-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>[b]</a></span></li></ul></li><li><span><a href="#Data-Cleaning/Processing" data-toc-modified-id="Data-Cleaning/Processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Cleaning/Processing</a></span><ul class="toc-item"><li><span><a href="#[a]" data-toc-modified-id="[a]-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>[a]</a></span></li><li><span><a href="#[b]" data-toc-modified-id="[b]-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>[b]</a></span></li><li><span><a href="#[c]" data-toc-modified-id="[c]-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>[c]</a></span></li><li><span><a href="#[d]" data-toc-modified-id="[d]-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>[d]</a></span></li><li><span><a href="#[e]" data-toc-modified-id="[e]-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>[e]</a></span></li></ul></li><li><span><a href="#Data-Exploration" data-toc-modified-id="Data-Exploration-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Exploration</a></span><ul class="toc-item"><li><span><a href="#[a]" data-toc-modified-id="[a]-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>[a]</a></span></li></ul></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Feature Engineering</a></span><ul class="toc-item"><li><span><a href="#[a]" data-toc-modified-id="[a]-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>[a]</a></span></li></ul></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#[a]" data-toc-modified-id="[a]-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>[a]</a></span></li></ul></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Evaluation</a></span><ul class="toc-item"><li><span><a href="#[a]" data-toc-modified-id="[a]-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>[a]</a></span></li></ul></li></ul></div>

## Introduction

This is intended to be a demo of your skills on a fairly common scientific python workflow. These challenges are meant to act as a baseline assessment of how you apply your skills to real-world use cases.  I don't expect a profitable system from this notebook so breathe easy. 

- `Data I/O:` Importing and saving datasets in a structured way
- `Data Cleaning/Processing:` Refining datasets for use in scientific analysis
- `Data Exploration:` Learning about the dataset
- `Feature Engineering:` Finding useful predictors
- `Modeling:` Testing predictors
- `Evaluation:` Interpreting results

## Data I/O

[Raw Options Data Download Link](https://drive.google.com/file/d/1ZZtVkDrLo7LysEQCrEyKpsmiGFivyuPl/view?usp=sharing)

This is an `hourly options dataset` covering a period of one month stored as a `parquet` file. The dataset is dirty.

### [a]
`Download` the dataset to the `./data/raw/` folder.

### [b]
`Import` the dataset. If you have difficulty try using `dask.dataframe` which lets users operate on data that is larger than memory. Below are some common package imports. Feel free to add your own as needed.

In [7]:
%load_ext watermark
%watermark

%load_ext autoreload
%autoreload 2

# import standard libs
import warnings
warnings.filterwarnings("ignore")
from IPython.display import display
from IPython.core.debugger import set_trace as bp
from pathlib import PurePath, Path
import sys
import time
import re
import os
import json

# get project dir
pp = PurePath(Path.cwd()).parts[:-1]
project_dir = PurePath(*pp)
print(f'\nproject directory: {project_dir}')
data_dir = project_dir / 'data'
script_dir = project_dir / 'src' 
sys.path.append(script_dir.as_posix())

# import python scientific stack
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from numba import jit
import math
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.options.display.float_format = '{:,.4f}'.format

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
pbar = ProgressBar(); pbar.register()

import multiprocessing as mp
from multiprocessing import cpu_count

# import visual tools
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
import seaborn as sns
blue, green, red, purple, gold, teal = sns.color_palette('colorblind', 6)


plt.style.use(['seaborn-talk','bmh'])
#mpl.rcParams['font.family'] = 'Bitstream Vera Sans'
#mpl.rcParams['font.size'] = 9.5
mpl.rcParams['font.weight'] = 'medium'
mpl.rcParams['figure.figsize'] = 10,7

print()
%watermark -p pandas,numpy,scipy,sklearn,dask,pyarrow,fastparquet,numba,matplotlib,seaborn

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
2018-04-11T16:40:28-06:00

CPython 3.6.4
IPython 6.2.1

compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
system     : Linux
release    : 4.4.0-116-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

project directory: /media/files/_Code/python_dataSci_skills_test

pandas 0.22.0
numpy 1.14.2
scipy 1.0.0
sklearn 0.19.1
dask 0.17.2
pyarrow 0.9.0
fastparquet 0.1.4
numba 0.36.2
matplotlib 2.2.2
seaborn 0.8.1


In [None]:
### Please begin here ###



## Data Cleaning/Processing

### [a]

`Clean` the raw options data. Use your preferred methodology to standardize the dataset. Please take care to comment or briefly describe what your code does. Make sure to track how long it takes your code to run.

### [b]
`Add` the following columns:
- `spread`: bid-ask spread
- `midquote`: 
- `spread_pct`: bid-ask spread expressed as pct of price
    
### [c]
`Add` the `calculated` intrinsic value of the options as a column called `intrinsic_value`.
    
When complete `save` the dataset as a `parquet` file in the `./data/interim/` folder.
### [d]
`Add` an indicator column to indicate options which are `In-the-money (ITM)`, `At-the-money (ATM)`, `Out-the-money (OTM)`. Use a `5%` corridor to determine if an option is `ITM`. Feel free to use a numeric label to represent the 3 classes.

### [e]

`Save` cleaned dataset to the `./data/processed/` folder as both `parquet` and `csv`. 

What is the size difference between the csv file and the parquet file expressed as a ratio?

In [1]:
### Please begin here ###



## Data Exploration

### [a] 

Use your favorite methodology to explore and extract interesting facts about the dataset. This is a good section to make use of visuals.

In [None]:
### Please begin here ###



## Feature Engineering

Our goal is to find a predictor of the `sign` of `ATM`, `IWM` option price returns.

### [a]

Use your favorite methodology to discover one or more features.

In [None]:
### Please begin here ###



## Modeling

### [a]

Use your favorite methodology to set up a model for testing. 

In [None]:
### Please begin here ###



## Evaluation

### [a]

Evaluate your model results. Did it turn out how you expected? What would you do differently if given more time?

In [None]:
### Please begin here ###

