# FireDucks: A Modern Alternative to Pandas

## Introduction
FireDucks is a powerful alternative to pandas, designed to be a drop-in 
replacement that can seamlessly integrate with your existing pandas code.

## System Requirements
### Prerequisites
- Python (version >3.8, <=3.12)
- Linux environment (x86_64 architecture) or Windows with WSL

#

## FireDucks Advantages

#### ✨ Key Benefits:

1. **Massive Speedup**
   - Dramatically faster data processing
   - Optimized execution model

2. **100% Compatibility with Existing Pandas Code**
   - Works with all pandas operations
   - No need to learn new syntax

3. **Zero Code Change**
   - Direct replacement for pandas
   - No refactoring needed

4. **Effortless / Super Easy to Use**
   - Simple pip installation
   - Immediate integration

#

## Installation
    pip install fireducks

#

## Two Ways to Use FireDucks

### 1. Import Hook (For Existing Projects)
Perfect for existing pandas projects - no code changes needed! FireDucks will automatically replace all pandas imports with its own implementation.

#### For Terminal/Python Scripts:

    python3 -m fireducks.pandas your_script.py

#### For Jupyter Notebook:
    %load_ext fireducks.pandas
    import pandas as pd    

>  **Note:** use import hooks for existing program as your program may include files that imports pandas that is deeply nested

### 2. Explicit Import (For New Projects)
For new projects, you can directly import FireDucks instead of pandas. This is the most straightforward approach when starting fresh.

    import fireducks.pandas as pd

#

## Execution Model: Eager vs Lazy
### Pandas (Eager Execution)
Pandas executes operations immediately when they are called:

<img src="pandasExecutionModel.png" alt="pandasExecutionModel">

In [1]:
import pandas as pd
df = pd.read_csv("Electric_Vehicle_Population_Data.csv") #reads file immediately
df

Unnamed: 0,VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
0,1N4AZ0CP8D,King,Shoreline,WA,98177.0,2013,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,75.0,0.0,32.0,125450447,POINT (-122.36498 47.72238),CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),5.303302e+10
1,5YJSA1E45K,King,Seattle,WA,98112.0,2019,TESLA,MODEL S,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,270.0,0.0,43.0,101662900,POINT (-122.30207 47.64085),CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),5.303301e+10
2,WVGUNPE28M,Kitsap,Olalla,WA,98359.0,2021,VOLKSWAGEN,ID.4,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0.0,0.0,26.0,272118717,POINT (-122.54729 47.42602),PUGET SOUND ENERGY INC,5.303509e+10
3,JTDKARFP6H,Thurston,Olympia,WA,98501.0,2017,TOYOTA,PRIUS PRIME,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,25.0,0.0,22.0,349372929,POINT (-122.89166 47.03956),PUGET SOUND ENERGY INC,5.306701e+10
4,1FADP5CU9G,Thurston,Olympia,WA,98506.0,2016,FORD,C-MAX,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,19.0,0.0,22.0,171625653,POINT (-122.87741 47.05997),PUGET SOUND ENERGY INC,5.306701e+10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
216767,1G1RB6E44D,Douglas,East Wenatchee,WA,98802.0,2013,CHEVROLET,VOLT,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,38.0,0.0,12.0,122822822,POINT (-120.29473 47.41515),PUD NO 1 OF DOUGLAS COUNTY,5.301795e+10
216768,KNDCS3LF9R,Whatcom,Bellingham,WA,98229.0,2024,KIA,NIRO,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,33.0,0.0,40.0,267143887,POINT (-122.45486 48.7449),PUGET SOUND ENERGY INC||PUD NO 1 OF WHATCOM CO...,5.307300e+10
216769,7SAYGAEE9R,King,Redmond,WA,98052.0,2024,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0.0,0.0,48.0,274988388,POINT (-122.13158 47.67858),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),5.303303e+10
216770,1G1RB6E49D,Pierce,Gig Harbor,WA,98329.0,2013,CHEVROLET,VOLT,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,38.0,0.0,26.0,117353064,POINT (-122.6658 47.38336),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,5.305307e+10


In [2]:
# Eager Execution (pandas)
import time

t0 = time.time()
df = pd.read_csv("Electric_Vehicle_Population_Data.csv") #reads file immediately
t1 = time.time()
eager_read_time = t1-t0
print(f"Read CSV took: {eager_read_time:.4f} seconds")

t0 = time.time()
df = df[df['Make'] == 'TESLA']['Electric Range'] #executes operation immediately
t1 = time.time()
eager_operation_time = t1-t0
print(f"Operations took: {eager_operation_time:.4f} seconds")

t0 = time.time()
df.to_csv("write.csv") #writes immediately
t1 = time.time()
eager_write_time = t1-t0
print(f"Write CSV took: {eager_write_time:.4f} seconds")

eager_total = eager_read_time + eager_operation_time + eager_write_time
print(f"Total eager execution time: {eager_total:.4f} seconds")

Read CSV took: 0.5366 seconds
Operations took: 0.0274 seconds
Write CSV took: 0.0704 seconds
Total eager execution time: 0.6344 seconds


### FireDucks (Lazy Execution)
FireDucks delays operations until results are actually needed:

<img src="fireducksExecutionModel.png" alt="fireducksExecutionModel">

In [3]:
# Lazy Execution (FireDucks)
import fireducks.pandas as pd
import time

t0 = time.time()
df = pd.read_csv("Electric_Vehicle_Population_Data.csv") #plans read
t1 = time.time()
lazy_plan_read = t1-t0
print(f"Delayed read CSV took: {lazy_plan_read:.4f} seconds")

t0 = time.time()
df = df[df['Make'] == 'TESLA']['Electric Range'] #plans operation
t1 = time.time()
lazy_plan_operation = t1-t0
print(f"Delayed sort took: {lazy_plan_operation:.4f} seconds")

t0 = time.time()
df.to_csv("sorted.csv") #now execuate all planned actions
t1 = time.time()
lazy_execute_all = t1-t0
print(f"Execute all operations delayed took: {lazy_execute_all:.4f} seconds")

lazy_total = lazy_plan_read + lazy_plan_operation + lazy_execute_all
print(f"Total lazy execution time: {lazy_total:.4f} seconds")

Delayed read CSV took: 0.0149 seconds
Delayed sort took: 0.0003 seconds
Execute all operations delayed took: 0.0826 seconds
Total lazy execution time: 0.0978 seconds


### Speedup analysis eager vs lazy

In [4]:
speedup = eager_total / lazy_total
print(f"\nSpeedup Analysis:")
print(f"Eager Execution total time: {eager_total:.4f} seconds")
print(f"Lazy Execution total time: {lazy_total:.4f} seconds")
print(f"FireDucks is {speedup:.1f}x faster than regular pandas")


Speedup Analysis:
Eager Execution total time: 0.6344 seconds
Lazy Execution total time: 0.0978 seconds
FireDucks is 6.5x faster than regular pandas


<img src="fireducksbenchmark(1).png" alt="benchmark">

> **NOTE:** use _evaluate() to bypass delayed execution for eager execution

In [5]:
t0 = time.time()
df = pd.read_csv("Electric_Vehicle_Population_Data.csv")._evaluate() #evaluate for immediate execution
t1 = time.time()
print(f"Delayed read CSV took: {t1-t0:.4f} seconds")

t0 = time.time()
df = df.sort_values(by="Model Year")._evaluate() #evaluate for immediate execution
t1 = time.time()
print(f"Delayed sort took: {t1-t0:.4f} seconds")

t0 = time.time()
df.to_csv("sorted.csv")
t1 = time.time()
print(f"Execute all operations delayed took: {t1-t0:.4f} seconds")

Delayed read CSV took: 0.1256 seconds
Delayed sort took: 0.0975 seconds
Execute all operations delayed took: 0.1474 seconds


### 1. Converts Python Code to IR (Intermediate Language)
FireDucks converts your Python DataFrame operations into a specialized intermediate language designed for optimal performance.

### 2. Automatic DataFrame Optimization
The compiler analyzes your DataFrame operations and:
- Optimizes column selections
- Minimizes memory usage
- Reduces redundant operations

### 3. Expert-Level Optimizations
Automatically applies optimizations that typically require deep DataFrame knowledge:
- Reorders operations for efficiency
- Uses column-oriented processing
- Optimizes data access patterns

### 4. Consistent Results
While making all these optimizations, FireDucks ensures:
- Same output as regular pandas
- No change in data accuracy
- Maintains data integrity


## How FireDucks Achieves Better Performance(what is going on during lazy execution)


### 1. Compiler Optimization
FireDucks uses a smart compiler that optimizes your code before running it:

In [6]:
import pandas as pd
import time

df = pd.read_csv("Electric_Vehicle_Population_Data.csv")

# Unoptimized version(what we would write)
t0 = time.time()
unoptimized = df[df['Make'] == 'TESLA']['Electric Range']
t1 = time.time()
unoptimized
print(f"Unoptimized took: {t1-t0:.4f} seconds")

# Optimized version
t0 = time.time()
filtered_df = df[['Make', 'Electric Range']]  # Select only needed columns
optimized = filtered_df[filtered_df['Make'] == 'TESLA']['Electric Range']
t1 = time.time()
print(f"Optimized took: {t1-t0:.4f} seconds")

Unoptimized took: 0.0212 seconds
Optimized took: 0.0167 seconds


> **NOTE:** optimization is more apparent with larger datasets

# More on optimization

#### FireDucks Multi-threaded Backend

* Uses Apache Arrow for data storage(same as pandas 2.0)
* Splits work across multiple CPU cores
* Adds extra parallelization on top of Arrow's capabilities

### Apache Arrow Integration

Universal Data Format

* Traditional DataFrame tools are usually language-specific (pandas for Python, data.frames for R)
* Works across multiple programming languages(Python, R, Java, C++, Rust, etc.)
* No need for data conversion between systems
* More efficient memory usage

<!-- <img src="apacheArrowDiagram.png" alt="apache arrow"> -->
