# **`DASK`, AN ALTERNATIVE TO `PANDAS`**
---
<img src="Imperial_logo.png" align = "left" width=200>
 <br><br>
 
- Copyright (c) Antoine Jacquier, 2024. All rights reserved

- Author: Jack Jacquier <a.jacquier@imperial.ac.uk>

- Platform: Tested on Windows 10 with Python 3.9

# S\&P 500 European options volume study .. from `pandas` to `dask`

In [None]:
import numpy as np
import pandas as pd
import time
import matplotlib.pylab as plt
from datetime import timedelta

In [None]:
df = pd.read_csv("../Data files/SPXOptions2018.csv")
## Data from https://optionmetrics.com/, available via Imperial
df.head()

In [None]:
df.info()

In [None]:
df.memory_usage(index=True)

In [None]:
df.memory_usage(index=True).sum()

In [None]:
df["date"] = pd.to_datetime(df["date"], format='%Y%m%d')
df["exdate"] = pd.to_datetime(df["exdate"], format='%Y%m%d')
df["timeToExpDays"] = (df["exdate"] - df["date"]).dt.days ## expiry in days
df.head()

## Analysis of the volume distribution

In [None]:
t0 = time.time()
df_pandas_vol = df.groupby(["timeToExpDays"], dropna=False, observed=True).agg({"volume": "sum"})
dt_pandas_groupby = time.time() - t0
print("pandas time: ", dt_pandas_groupby)
df_pandas_vol.head()

In [None]:
perc = (100./df_pandas_vol.sum()).values
truncation_level = 200

fig, ax1 = plt.subplots(figsize=(8, 4))

ax1.set_xlabel("Time to expiration (in days)")
ax1.set_ylabel("Volume distribution (in %)", color='black')
ax1.tick_params(axis='y', labelcolor='black')
ax1.plot(df_pandas_vol.index[:truncation_level], perc*df_pandas_vol["volume"][:truncation_level], 'k.')


ax2 = ax1.twinx()
ax2.set_ylabel("Cumulative volume (in %)", color='blue')
ax2.tick_params(axis='y', labelcolor='blue')
ax2.plot(df_pandas_vol.index[:truncation_level], perc*np.cumsum(df_pandas_vol["volume"][:truncation_level]), 'b.')
plt.title("Volume distribution in 2018 (in %)")
fig.tight_layout()
plt.show()

In [None]:
#del df

## Working with a much larger DataFrame (as an example of "Big Data")
The cell below may be skipped and import the file directly from`pathToFile`

In [None]:
## Cell that creates a very large file by copy/pasting the previous one (and changing the dates)
## Note that the last line (saving it as .csv) takes a long time
pathToFile = "../largeDataFrame.csv"
df_large = pd.DataFrame()
for i in range(1, 11):
    df_temp = df.copy()
    df_temp["date"] = df_temp["date"] - pd.Timedelta(days=1000*i)
    df_large = pd.concat([df_large, df_temp])
    print(i, df_large.count().iloc[0])

del df_temp
print("****************")
print("Saving to drive....")
df_large.to_csv(pathToFile)
print("Saved to drive")

In [None]:
pathToFile = "../largeDataFrame.csv"
t0 = time.time()
df_large = pd.read_csv(pathToFile)
dt_pandas_import = time.time() - t0
print("pandas import time:", dt_pandas_import)

In [None]:
df_large.head()

In [None]:
df_large.info()

In [None]:
## Groupby action on the bigger DataFrame
t0 = time.time()
df_pandas_vol = df_large.groupby(["timeToExpDays"], dropna=False, observed=True).agg({"volume": "sum"})
dt_pandas_groupby = time.time() - t0
print("pandas time: ", dt_pandas_groupby)
df_pandas_vol.head()

In [None]:
df_pandas_vol.info()

# Introducing `dask`


Dask was created by Matthew Rocklin in December 2014.

https://examples.dask.org/dataframe.html


Examples of users:

- Walmart for forecasting the demand for 500,000,000 store-item combinations: https://www.nvidia.com/en-us/glossary/dask/
- 
Blue Yonder to process terabytes of data on a daily basi: https://tech.blueyonder.com/dask-usage-at-blue-yonder/
- Capital One uses it for big data analytics: https://www.nvidia.com/en-us/glossary/dask/.

In [None]:
import dask.dataframe as dd

In [None]:
t0 = time.time()
df_dask = dd.read_csv(pathToFile)
dt_dask_import = time.time() - t0
print("dask import time:", dt_dask_import)

df_dask.head()

In [None]:
t0 = time.time()
df_dask_volume = df_dask.groupby(["timeToExpDays"], dropna=False, observed=True).agg({"volume": "sum"})
dt_dask_groupby = time.time() - t0
print("dask time: ", dt_dask_groupby)

Operations on a `dask` DataFrame are *lazy*, namely they are only computed whenever they are actually needed. 

In [None]:
df_dask_volume.info

In [None]:
df_dask_volume.index

In [None]:
df_dask_volume.head()

In [None]:
#df_dask_volume.compute()

### Plotting?
`matplotlib` is the go-to library for plotting `pandas` DataFrames. However, it can be very cumbersome with very large datasets. 

We use instead `hvplot` for `dask` DataFrames.

In [None]:
import hvplot.dask

In [None]:
df_dask_volume.hvplot.scatter(x='timeToExpDays', y='volume')

In [None]:
del df_dask_volume, df_dask

## Using `parquet` for partitioning
Parquet is a popular, columnar file format designed for efficient data storage and retrieval.

In [None]:
df_dask_part = dd.read_csv(pathToFile, blocksize=25e6)
print("Number of partitions:", df_dask_part.npartitions)

In [None]:
df_dask_part.to_parquet("../to/output", name_function=lambda i: f"data-{i}.parquet")
del df_dask_part

In [None]:
df_parq = dd.read_parquet("../to/output/", name_function=lambda i: f"data-{i}.parquet", engine="pyarrow", columns=["volume", "timeToExpDays"])

In [None]:
df_parq.info()

In [None]:
t0 = time.time()
df_dask_volume_parq = df_parq.groupby(["timeToExpDays"], dropna=False, observed=True).agg({"volume": "sum"})
dt_parq_groupby = time.time() - t0
print("parquet time: ", dt_parq_groupby)

In [None]:
df_dask_volume_parq.head()

In [None]:
df_dask_volume_parq.sort_values("timeToExpDays").head()

In [None]:
df_dask_volume_parq.hvplot.scatter(x='timeToExpDays', y='volume')

In [None]:
del df_dask_volume_parq, df_parq

### Partitioning with fixed number of partitions

In [None]:
from dask.dataframe import from_pandas

In [None]:
t0 = time.time()
df_from_pandas = from_pandas(df_large, npartitions=5)
dt_dask_part_import = time.time() - t0
print("dask partitioning import: ", dt_dask_part_import)

t0 = time.time()
df_dask_volume_from_pandas = df_from_pandas.groupby(["timeToExpDays"], dropna=False, observed=True).agg({"volume": "sum"})
dt_dask_part_groupby = time.time() - t0
print("dask partitioning time: ", dt_dask_part_groupby)

df_dask_volume_from_pandas.head()

# PS: Overview of `pandas` alternatives
source: https://www.altexsoft.com/blog/pandas-library/

<img src="pandas_alternatives.png" align = "left" width=1000>
