<a href="https://colab.research.google.com/github/Mevaria/AAI614_Wehbe/blob/main/Pandas%20VS%20Dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Vanessa Wehbe

# Experiment: pandas vs Dask on the NYC Parking Tickets dataset


**Dataset:**  
 ([kaggle.com](https://www.kaggle.com/datasets/new-york-city/nyc-parking-tickets?resource=download))



In [8]:
!pip install memory-profiler psutil


Collecting memory-profiler
  Downloading memory_profiler-0.61.0-py3-none-any.whl.metadata (20 kB)
Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory-profiler
Successfully installed memory-profiler-0.61.0


In [9]:
# Imports
import time
import pandas as pd
import dask.dataframe as dd
import os
from memory_profiler import memory_usage
import psutil

In [None]:
csv_path = "/content/Parking_Violations_Issued_-_Fiscal_Year_2017.csv"

## 1. Using pandas

In [10]:
def get_memory_mb():
    return psutil.Process(os.getpid()).memory_info().rss / (1024 ** 2)

mem_before = get_memory_mb()
t0 = time.time()
df = pd.read_csv(csv_path, low_memory=False)
t1 = time.time()
mem_after = get_memory_mb()

print(f"Pandas load time: {t1 - t0:.2f} seconds")
print(f"Memory used: {mem_after - mem_before:.2f} MB")

t0 = time.time()
result_pd = df.groupby("Vehicle Make").size().nlargest(10)
t1 = time.time()
print(result_pd)
print(f"Pandas groupby time: {t1 - t0:.2f} seconds")

Pandas load time: 41.55 seconds
Memory used: 1584.40 MB
Vehicle Make
FORD     738281
TOYOT    696746
HONDA    620879
NISSA    528741
CHEVR    412110
FRUEH    248388
ME/BE    224140
BMW      215784
DODGE    214393
JEEP     200299
dtype: int64
Pandas groupby time: 0.35 seconds


## 2. Using Dask

In [11]:
import dask.dataframe as dd
import psutil, os, time

dtype_fixes = {"House Number": "object", "Time First Observed": "object"}

mem_before = get_memory_mb()
t0 = time.time()
ddf = dd.read_csv(csv_path, assume_missing=True, dtype=dtype_fixes, blocksize="64MB")
t1 = time.time()
mem_after = get_memory_mb()

print(f"Dask load time (lazy): {t1 - t0:.2f} seconds")
print(f"Memory used during load: {mem_after - mem_before:.2f} MB")

# Trigger computation
t0 = time.time()
result_dd = ddf.groupby("Vehicle Make").size().nlargest(10).compute()
t1 = time.time()
mem_after_compute = get_memory_mb()

print(result_dd)
print(f"Dask compute time: {t1 - t0:.2f} seconds")
print(f"Memory used during compute: {mem_after_compute - mem_before:.2f} MB")

Dask load time (lazy): 0.04 seconds
Memory used during load: 0.03 MB
Vehicle Make
FORD     773305
TOYOT    729992
HONDA    650823
NISSA    553816
CHEVR    431913
FRUEH    260345
ME/BE    234807
BMW      226354
DODGE    224654
JEEP     210118
dtype: int64
Dask compute time: 27.96 seconds
Memory used during compute: 8.50 MB
