# Dask - A Faster Alternative to Pandas

This notebook compares the performance of Dask and Pandas on large datasets.

We will be comparing the following -
- Reading Large Dataset
- Grouping By and Aggregation
- Merging Datasets
- Filtering Data
- Apply Function
- Distributed Computing

My Current Device Specs -

- OS - Windows 11
- Processor - 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz
- RAM - 32 GB
- Cores - 8
- GPU - NVIDIA Geforce RTX 3060


Dask Documentation - [LINK](https://docs.dask.org/en/stable/)

__Libraries Used:__

1. [Pandas](https://pandas.pydata.org/)
2. [Dask](https://www.dask.org/)
3. [Time](https://docs.python.org/3/library/time.html)
4. [Numpy](https://numpy.org/)

__Let's get started!__

First let's install the libraries and import it

In [109]:
%pip install -q pandas
%pip install -q dask
%pip install -q numpy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [110]:
import pandas as pd
import dask.dataframe as dd
import time
import numpy as np

### Reading Large Dataset

Lets read a dataframe with 50 million rows and 3 columns respectively.

In [111]:
pd.DataFrame({
               'A': np.random.randint(0, 100, size=10000000),
               'B': np.random.randint(0, 100, size=10000000),
               'C': np.random.randint(0, 100, size=10000000),
              }).to_csv('dataset.csv', index=False)

In [112]:
start_time = time.time()
df = pd.read_csv('dataset.csv')
pandas_time = time.time() - start_time
print(f"Pandas: shape = {df.shape}, time = {pandas_time} seconds")


Pandas: shape = (10000000, 3), time = 0.9480597972869873 seconds


In [113]:
# Read the same file using Dask
start_time = time.time()
dask_df = dd.read_csv('dataset.csv')
dask_time = time.time() - start_time
print(f"Dask: shape = {dask_df.compute().shape}, time = {dask_time} seconds")

Dask: shape = (10000000, 3), time = 0.00696563720703125 seconds


In [114]:
df.head()

Unnamed: 0,A,B,C
0,49,52,73
1,32,64,97
2,85,26,5
3,97,76,18
4,36,17,68


In [115]:
dask_df.head()

Unnamed: 0,A,B,C
0,49,52,73
1,32,64,97
2,85,26,5
3,97,76,18
4,36,17,68


### Grouping By and Aggregation

In [116]:
# Time the groupby operation using Pandas
start_time = time.time()
pandas_grouped = df.groupby(['A', 'B']).agg({'C': 'sum'})
pandas_time = time.time() - start_time
print(f"Pandas: Time = {pandas_time} seconds")

Pandas: Time = 0.450380802154541 seconds


In [117]:
# Time the groupby operation using Dask
start_time = time.time()
dask_groupby = dask_df.groupby(['A', 'B']).agg({'C': 'sum'})
dask_time = time.time() - start_time
print(f"Dask: Time = {dask_time} seconds")

Dask: Time = 0.005944252014160156 seconds


### Merging Datasets

In [118]:
# Merge using Pandas
start_time = time.time()
merged_pandas = pd.merge(df, df)
pandas_time = time.time() - start_time
print(f"Pandas: Time = {pandas_time} seconds")

Pandas: Time = 6.819661378860474 seconds


In [119]:
# Merge using Dask
start_time = time.time()
merged_dask = dd.merge(dask_df, dask_df)
dask_time = time.time() - start_time
print(f"Dask: Time = {dask_time} seconds")

Dask: Time = 0.012986898422241211 seconds


### Filtering Data

In [120]:
# Filtering using Pandas
start_time = time.time()
selected_pandas = df[df['A'] > 5000000]
pandas_time = time.time() - start_time
print(f"Pandas: Time = {pandas_time} seconds")

Pandas: Time = 0.024980783462524414 seconds


In [121]:
# Filtering using Dask
start_time = time.time()
selected_dask = dask_df[dask_df['A'] > 5000000]
dask_time = time.time() - start_time
print(f"Dask: Time = {dask_time} seconds")

Dask: Time = 0.0059452056884765625 seconds


### Apply Function

In [122]:
# Function to perform Apply on
def my_function(x):
    return x * 2

In [123]:
# Filtering using Pandas
start_time = time.time()
applied_pandas = df['A'].apply(my_function)
pandas_time = time.time() - start_time
print(f"Pandas: Time = {pandas_time} seconds")

Pandas: Time = 2.7721312046051025 seconds


In [124]:
# Filtering using Dask
start_time = time.time()
applied_dask = dask_df['A'].map(my_function)
dask_time = time.time() - start_time
print(f"Dask: Time = {dask_time} seconds")

Dask: Time = 0.001998424530029297 seconds


### Distributed Computing

In [125]:
# Compute the sum of column A using Dask on a distributed cluster
start_time = time.time()
applied_dask = dask_df['A'].map_partitions(my_function)
dask_time = time.time() - start_time
print(f"Dask: Time = {dask_time} seconds")

Dask: Time = 0.008009672164916992 seconds
