# Pandas, Dask and Polars

* Pandas: <br>
General Purpose data analysis, used for small to medium datasets

* Polars <br>
Fast, multi-threaded Dataframe Library, used for ultra-fast processing, large files


* Dask <br>
Parallel/ Distributed computation, used for large datasets, out of core processing



### Installation

**pip install pandas <br>**
**pip install polars <br>**
**pip install dask[complete]**

### Basic Syntax Comparison

**csv example:**

name,age,salary <br>
Alice,30,70000<br>
Bob,25,50000<br>
Carol, 27,60000<br>

In [1]:
import pandas as pd

pd_df = pd.read_csv('sample_data.csv')

print(pd_df['salary'].mean())

60000.0


In [2]:
import dask.dataframe as dd

dd_df = dd.read_csv('sample_data.csv')
print(dd_df['salary'].mean().compute())

# compute is needed to trigger execution

60000.0


In [3]:
import polars as pl

pl_df = pl.read_csv('sample_data.csv')
print(pl_df.select(pl.col('salary').mean()))

shape: (1, 1)
┌─────────┐
│ salary  │
│ ---     │
│ f64     │
╞═════════╡
│ 60000.0 │
└─────────┘


## Major Differences

| Feature            | **Pandas**            | **Polars**                             | **Dask**                              |
| ------------------ | --------------------- | -------------------------------------- | ------------------------------------- |
| Execution model    | Eager                 | Lazy + Eager                           | Lazy                                  |
| Speed              | Moderate              | Very fast (multi-threaded, Rust-based) | Scales well across cores/machines     |
| Memory usage       | High (in-memory only) | Low (zero-copy, efficient memory use)  | Out-of-core (disk + memory)           |
| Parallelism        | No                    | Yes (built-in)                         | Yes (distributed optional)            |
| Syntax familiarity | Very user-friendly    | Slightly different                     | Mostly Pandas-like                    |
| Use case           | Up to \~1M–10M rows   | 10M+ rows, speed-critical apps         | 1GB+ files, multi-GB to TB-scale data |


**Lazy**:- computations not executed until needed<br>
**eager**:- computation executed immediately

In [4]:
import pandas as pd

pd_df = pd.read_csv('sample_data.csv')     # File is read immediately
filtered = pd_df[pd_df['age'] > 25]    # Filtering happens now
mean_salary = filtered['salary'].mean()  # Computation happens now
print(mean_salary)

65000.0


In [5]:
#lazy
import polars as pl

pl_df = pl.scan_csv('sample_data.csv')           # Does NOT read file yet
result = pl_df.filter(pl.col('age') > 25) # Builds query plan
result = result.select(pl.col('salary').mean())  # Still no execution
final = result.collect()               # Executes everything here
# .collect() must be called for polars and .compute() must be called for dask
print(final)

shape: (1, 1)
┌─────────┐
│ salary  │
│ ---     │
│ f64     │
╞═════════╡
│ 65000.0 │
└─────────┘


## Filtering Rows

In [6]:
#Pandas
n = pd_df[pd_df['age'] > 25]
print(n)

#Polars
p = pl_df.filter(pl.col('age') > 25)
print(p)

#Dask
d = dd_df[dd_df['age'] > 25]
print(d)

# this dask creates a lazy plan for filtering to execute we must use compute
result = dd_df[dd_df['age'] > 25].compute()
print(result)

    name  age  salary
0  Alice   30   70000
2  Carol   27   60000
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [(col("age")) > (25)]
FROM
  Csv SCAN [sample_data.csv] [id: 3829098743232]
  PROJECT */3 COLUMNS
Dask DataFrame Structure:
                 name    age salary
npartitions=1                      
               string  int64  int64
                  ...    ...    ...
Dask Name: getitem, 5 expressions
Expr=Filter(frame=ArrowStringConversion(frame=FromMapProjectable(99f3ce3)), predicate=ArrowStringConversion(frame=FromMapProjectable(99f3ce3))['age'] > 25)
    name  age  salary
0  Alice   30   70000
2  Carol   27   60000


## Group By and Aggregation

In [7]:
# Pandas
n = pd_df.groupby('name')['salary'].mean()
print(n)

# Polars
pl_df = pl.read_csv('sample_data.csv')  # eager mode
p = pl_df.group_by('name').agg(pl.col('salary').mean())
print(p)

# Dask
d= dd_df.groupby('name')['salary'].mean().compute()
print(d)

name
Alice    70000.0
Bob      50000.0
Carol    60000.0
Name: salary, dtype: float64
shape: (3, 2)
┌───────┬─────────┐
│ name  ┆ salary  │
│ ---   ┆ ---     │
│ str   ┆ f64     │
╞═══════╪═════════╡
│ Alice ┆ 70000.0 │
│ Carol ┆ 60000.0 │
│ Bob   ┆ 50000.0 │
└───────┴─────────┘
name
Alice    70000.0
Bob      50000.0
Carol    60000.0
Name: salary, dtype: float64


## When to use what

| Use Case                                  | Recommended Library    |
| ----------------------------------------- | ---------------------- |
| Simple data analysis (<500MB)             | Pandas                 |
| Performance-critical, multi-core systems  | Polars                 |
| Large datasets that don’t fit in memory   | Dask                   |
| Real-time streaming-like batch processing | Dask or Polars Lazy    |
| Familiar syntax and learning path         | Pandas → Dask → Polars |
