## Dask:
Why is Dask better than Pandas?

Pandas is better suited for small-to-medium-sized datasets that fit into memory,

while Dask is designed to handle larger-than-memory datasets with distributed computing.

While Pandas is easier to use, Dask's performance and scalability make it a better choice for handling larger datasets.

In [2]:
#pip install "dask[complete]"

### Dask has 3 parallel collections namely Dataframes, Bags, and Arrays

The ability to work in parallel with NumPy array and Pandas DataFrame objects.

Integration with other projects.

Distributed computing.

Faster operation because of its low overhead and minimum serialization.

Runs resiliently on clusters with thousands of cores.

Real-time feedback and diagnostics.

## Demonstration:

In [5]:
# Importing pandas, numpy and datetime
import numpy as np
import pandas as pd
from datetime import datetime

In [6]:
# Importing dask dataframe
import dask
import dask.dataframe as dd

In [19]:
df = pd.read_csv('penguins2.csv')
df

Unnamed: 0,Island,CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species
0,Torgersen,39.1,18.7,181.0,3750.0,0
1,Torgersen,39.5,17.4,186.0,3800.0,0
2,Torgersen,40.3,18.0,195.0,3250.0,0
3,Torgersen,0.0,0.0,0.0,0.0,0
4,Torgersen,36.7,19.3,193.0,3450.0,0
...,...,...,...,...,...,...
339,Dream,55.8,19.8,207.0,4000.0,2
340,Dream,43.5,18.1,202.0,3400.0,2
341,Dream,49.6,18.2,193.0,3775.0,2
342,Dream,50.8,19.0,210.0,4100.0,2


In [7]:
# code to measure the time taken to do an operation
start_time = datetime.now() 
# insert the code here
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00


In [8]:
# Get the file size
import os
file = os.path.getsize("/Users/Pranay/Downloads/penguins.csv")
print ('File size in bytes is {}'.format(file))

File size in bytes is 9878


## Dask over Pandas:
Reading a file — Pandas & Dask:

In [10]:
# Reading a file with pandas, size over 4GB
start_time = datetime.now() 
pandasfile = pd.read_csv('penguins2.csv')
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))


Time elapsed (hh:mm:ss.ms) 0:00:00.022563


In [11]:

# Reading a file with dask, size over 4GB
start_time = datetime.now() 
daskFile = dd.read_csv('penguins2.csv')
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.022976


## Appending two files — Pandas & Dask:

In [12]:
# let's read another file for appending pandas dataframe
start_time = datetime.now() 
pandasAppend= pd.read_csv('penguins2.csv')
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.012193


In [13]:
# Appending pandas dataframes
start_time = datetime.now() 
finalFile= pandasfile.append(pandasAppend)
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.013071


In [14]:
# let's read another file for appending dask dataframe
start_time = datetime.now() 
daskAppend = dd.read_csv('penguins2.csv')
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.010624


In [15]:
# Appending dask dataframes
start_time = datetime.now() 
daskFinalFile= daskFile.append(daskAppend)
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.039625


## Grouping the data — Pandas & Dask:

In [21]:
# Grouping with pandas
start_time = datetime.now() 
finalFile.groupby('Island')['BodyMass']
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.008446


In [22]:
# Grouping with dash
start_time = datetime.now() 
daskFinalFile.groupby('Island')['BodyMass']
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00


## Merging the datasets — Pandas & Dask:

In [24]:
# Merging the datasets with pandas
start_time = datetime.now() 
mergedPandas= pd.merge(pandasfile, pandasAppend, on="BodyMass")
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.021897


In [25]:
# Merging the datasets with dask
start_time = datetime.now() 
mergedDask= dd.merge(daskFile, daskAppend, on="BodyMass")
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.043800


## Pandas over Dask:
## Sorting — Pandas & Dask:

In [26]:
# Sorting the data using pandas
start_time = datetime.now() 
sortedPandas=finalFile.sort_values(by='Island')
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.008225


In [27]:
# Sorting the data using dask
start_time = datetime.now() 
sortedDask=daskFinalFile.sort_values(by='Island')
time_elapsed = datetime.now() - start_time
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.097403


## Unique & notNA — Pandas & Dask:

In [35]:
# Getting not NA values using pandas
start_time = datetime.now() 
notnaPandas=finalFile.notna()
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00


In [40]:
# Getting unique values using pandas
start_time = datetime.now() 
uniquePandas=pd.unique(pandasfile['Island'])
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.008070


In [41]:
# Getting not NA values using dask
start_time = datetime.now() 
notnaDask=daskFinalFile.notna()
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

AttributeError: 'DataFrame' object has no attribute 'notna'

In [38]:
# Getting unique values using dask
start_time = datetime.now() 
notnaDask=dd.unique(daskFinalFile['id'])
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

AttributeError: module 'dask.dataframe' has no attribute 'unique'

## Saving a Dataframe to a file — Pandas & Dask:

In [43]:
# saving a file using pandas
start_time = datetime.now() 
uniquePandas.to_csv('finalFile.csv')
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

AttributeError: 'numpy.ndarray' object has no attribute 'to_csv'

In [32]:
# saving a file using dask
start_time = datetime.now() 
daskFinalFile.to_csv('FinalFile.csv')
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.057822


In [33]:
# Convert dask dataframe to pandas dataframe if you want to save the results to a file
start_time = datetime.now() 
daskFinalFile.compute()
time_elapsed = datetime.now() - start_time 
print('Time elapsed (hh:mm:ss.ms) {}'.format(time_elapsed))

Time elapsed (hh:mm:ss.ms) 0:00:00.041831


## Conclusion:

It is always the best option to use pandas and dask together because one can fill other’s limitations very well.

When used individually, I suppose that you might run into different issues. 

Hence, we conclude that Pandas with Dask can save you a lot of time and resources.

# TIP:

In [36]:
# Use this to make pandas run faster
pd.read_csv('penguins2.csv', engine = 'c')

Unnamed: 0,Island,CulmenLength,CulmenDepth,FlipperLength,BodyMass,Species
0,Torgersen,39.1,18.7,181.0,3750.0,0
1,Torgersen,39.5,17.4,186.0,3800.0,0
2,Torgersen,40.3,18.0,195.0,3250.0,0
3,Torgersen,0.0,0.0,0.0,0.0,0
4,Torgersen,36.7,19.3,193.0,3450.0,0
...,...,...,...,...,...,...
339,Dream,55.8,19.8,207.0,4000.0,2
340,Dream,43.5,18.1,202.0,3400.0,2
341,Dream,49.6,18.2,193.0,3775.0,2
342,Dream,50.8,19.0,210.0,4100.0,2
