# Pandas, Dask and Polars

* Pandas: <br>
General Purpose data analysis, used for small to medium datasets

* Polars <br>
Fast, multi-threaded Dataframe Library, used for ultra-fast processing, large files


* Dask <br>
Parallel/ Distributed computation, used for large datasets, out of core processing



### Installation

**pip install pandas <br>**
**pip install polars <br>**
**pip install dask[complete]**

### Basic Syntax Comparison

**csv example:**

name,age,salary <br>
Alice,30,70000<br>
Bob,25,50000<br>
Carol, 27,60000<br>

In [1]:
import pandas as pd

pd_df = pd.read_csv('sample_data.csv')

print(pd_df['salary'].mean())

60000.0


In [2]:
import dask.dataframe as dd

dd_df = dd.read_csv('sample_data.csv')
print(dd_df['salary'].mean().compute())

# compute is needed to trigger execution

60000.0


In [3]:
import polars as pl

pl_df = pl.read_csv('sample_data.csv')
print(pl_df.select(pl.col('salary').mean()))

shape: (1, 1)
┌─────────┐
│ salary  │
│ ---     │
│ f64     │
╞═════════╡
│ 60000.0 │
└─────────┘


## Major Differences

| Feature            | **Pandas**            | **Polars**                             | **Dask**                              |
| ------------------ | --------------------- | -------------------------------------- | ------------------------------------- |
| Execution model    | Eager                 | Lazy + Eager                           | Lazy                                  |
| Speed              | Moderate              | Very fast (multi-threaded, Rust-based) | Scales well across cores/machines     |
| Memory usage       | High (in-memory only) | Low (zero-copy, efficient memory use)  | Out-of-core (disk + memory)           |
| Parallelism        | No                    | Yes (built-in)                         | Yes (distributed optional)            |
| Syntax familiarity | Very user-friendly    | Slightly different                     | Mostly Pandas-like                    |
| Use case           | Up to \~1M–10M rows   | 10M+ rows, speed-critical apps         | 1GB+ files, multi-GB to TB-scale data |


**Lazy**:- computations not executed until needed<br>
**eager**:- computation executed immediately

In [4]:
import pandas as pd

pd_df = pd.read_csv('sample_data.csv')     # File is read immediately
filtered = pd_df[pd_df['age'] > 25]    # Filtering happens now
mean_salary = filtered['salary'].mean()  # Computation happens now
print(mean_salary)

65000.0


In [5]:
#lazy
import polars as pl

pl_df = pl.scan_csv('sample_data.csv')           # Does NOT read file yet
result = pl_df.filter(pl.col('age') > 25) # Builds query plan
result = result.select(pl.col('salary').mean())  # Still no execution
final = result.collect()               # Executes everything here
# .collect() must be called for polars and .compute() must be called for dask
print(final)

shape: (1, 1)
┌─────────┐
│ salary  │
│ ---     │
│ f64     │
╞═════════╡
│ 65000.0 │
└─────────┘


## Filtering Rows

In [6]:
#Pandas
n = pd_df[pd_df['age'] > 25]
print(n)

#Polars
p = pl_df.filter(pl.col('age') > 25)
print(p)

#Dask
d = dd_df[dd_df['age'] > 25]
print(d)

# this dask creates a lazy plan for filtering to execute we must use compute
result = dd_df[dd_df['age'] > 25].compute()
print(result)

    name  age  salary
0  Alice   30   70000
2  Carol   27   60000
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [(col("age")) > (25)]
FROM
  Csv SCAN [sample_data.csv] [id: 3829098743232]
  PROJECT */3 COLUMNS
Dask DataFrame Structure:
                 name    age salary
npartitions=1                      
               string  int64  int64
                  ...    ...    ...
Dask Name: getitem, 5 expressions
Expr=Filter(frame=ArrowStringConversion(frame=FromMapProjectable(99f3ce3)), predicate=ArrowStringConversion(frame=FromMapProjectable(99f3ce3))['age'] > 25)
    name  age  salary
0  Alice   30   70000
2  Carol   27   60000


## Group By and Aggregation

In [7]:
# Pandas
n = pd_df.groupby('name')['salary'].mean()
print(n)

# Polars
pl_df = pl.read_csv('sample_data.csv')  # eager mode
p = pl_df.group_by('name').agg(pl.col('salary').mean())
print(p)

# Dask
d= dd_df.groupby('name')['salary'].mean().compute()
print(d)

name
Alice    70000.0
Bob      50000.0
Carol    60000.0
Name: salary, dtype: float64
shape: (3, 2)
┌───────┬─────────┐
│ name  ┆ salary  │
│ ---   ┆ ---     │
│ str   ┆ f64     │
╞═══════╪═════════╡
│ Alice ┆ 70000.0 │
│ Carol ┆ 60000.0 │
│ Bob   ┆ 50000.0 │
└───────┴─────────┘
name
Alice    70000.0
Bob      50000.0
Carol    60000.0
Name: salary, dtype: float64


## When to use what

| Use Case                                  | Recommended Library    |
| ----------------------------------------- | ---------------------- |
| Simple data analysis (<500MB)             | Pandas                 |
| Performance-critical, multi-core systems  | Polars                 |
| Large datasets that don’t fit in memory   | Dask                   |
| Real-time streaming-like batch processing | Dask or Polars Lazy    |
| Familiar syntax and learning path         | Pandas → Dask → Polars |


In [1]:
import pandas as pd
import dask.dataframe as dd
import polars as pl

In [2]:
# read csv

# pandas 
pd_df = pd.read_csv("sample_employees.csv")

# dask
dd_df = dd.read_csv("sample_employees.csv")

# Polars
pl_df = pl.read_csv("sample_employees.csv")

In [4]:
# group by and mean

# pandas
print(pd_df.groupby("department")["salary"].mean())

# dask
print(dd_df.groupby("department")["salary"].mean().compute())

# polars
print(pl_df.group_by("department").agg(pl.col("salary").mean()))

department
Finance    58500.0
HR         51000.0
IT         61000.0
Name: salary, dtype: float64
department
Finance    58500.0
HR         51000.0
IT         61000.0
Name: salary, dtype: float64
shape: (3, 2)
┌────────────┬─────────┐
│ department ┆ salary  │
│ ---        ┆ ---     │
│ str        ┆ f64     │
╞════════════╪═════════╡
│ HR         ┆ 51000.0 │
│ IT         ┆ 61000.0 │
│ Finance    ┆ 58500.0 │
└────────────┴─────────┘


In [5]:
# filtering rows

# pandas
print(pd_df[pd_df["salary"] > 60000])

# dask
print(dd_df[dd_df["salary"] > 60000].compute())

# polars
print(pl_df.filter(pl.col("salary") > 60000))

  department employee  salary  age
3         IT    David   61000   35
6         IT    Grace   62000   27
  department employee  salary  age
3         IT    David   61000   35
6         IT    Grace   62000   27
shape: (2, 4)
┌────────────┬──────────┬────────┬─────┐
│ department ┆ employee ┆ salary ┆ age │
│ ---        ┆ ---      ┆ ---    ┆ --- │
│ str        ┆ str      ┆ i64    ┆ i64 │
╞════════════╪══════════╪════════╪═════╡
│ IT         ┆ David    ┆ 61000  ┆ 35  │
│ IT         ┆ Grace    ┆ 62000  ┆ 27  │
└────────────┴──────────┴────────┴─────┘


In [6]:
# adding new column : salary in thousands

# pandas
pd_df["salary_k"] = pd_df["salary"] / 1000
print(pd_df.head())

# dask
dd_df = dd_df.assign(salary_k = dd_df["salary"] / 1000)
print(dd_df.compute().head())

# Polars
pl_df = pl_df.with_columns((pl.col("salary") / 1000).alias("salary_k"))
print(pl_df.head())

  department employee  salary  age  salary_k
0         HR    Alice   50000   25      50.0
1         HR      Bob   52000   30      52.0
2         IT  Charlie   60000   28      60.0
3         IT    David   61000   35      61.0
4    Finance      Eve   58000   40      58.0
  department employee  salary  age  salary_k
0         HR    Alice   50000   25      50.0
1         HR      Bob   52000   30      52.0
2         IT  Charlie   60000   28      60.0
3         IT    David   61000   35      61.0
4    Finance      Eve   58000   40      58.0
shape: (5, 5)
┌────────────┬──────────┬────────┬─────┬──────────┐
│ department ┆ employee ┆ salary ┆ age ┆ salary_k │
│ ---        ┆ ---      ┆ ---    ┆ --- ┆ ---      │
│ str        ┆ str      ┆ i64    ┆ i64 ┆ f64      │
╞════════════╪══════════╪════════╪═════╪══════════╡
│ HR         ┆ Alice    ┆ 50000  ┆ 25  ┆ 50.0     │
│ HR         ┆ Bob      ┆ 52000  ┆ 30  ┆ 52.0     │
│ IT         ┆ Charlie  ┆ 60000  ┆ 28  ┆ 60.0     │
│ IT         ┆ David    ┆ 6100

In [7]:
# sorting by age

# pandas
print(pd_df.sort_values("age"))

# dask
print(dd_df.sort_values("age").compute())

# polars
print(pl_df.sort("age"))

  department employee  salary  age  salary_k
0         HR    Alice   50000   25      50.0
6         IT    Grace   62000   27      62.0
2         IT  Charlie   60000   28      60.0
1         HR      Bob   52000   30      52.0
3         IT    David   61000   35      61.0
5    Finance    Frank   59000   38      59.0
4    Finance      Eve   58000   40      58.0
  department employee  salary  age  salary_k
0         HR    Alice   50000   25      50.0
6         IT    Grace   62000   27      62.0
2         IT  Charlie   60000   28      60.0
1         HR      Bob   52000   30      52.0
3         IT    David   61000   35      61.0
5    Finance    Frank   59000   38      59.0
4    Finance      Eve   58000   40      58.0
shape: (7, 5)
┌────────────┬──────────┬────────┬─────┬──────────┐
│ department ┆ employee ┆ salary ┆ age ┆ salary_k │
│ ---        ┆ ---      ┆ ---    ┆ --- ┆ ---      │
│ str        ┆ str      ┆ i64    ┆ i64 ┆ f64      │
╞════════════╪══════════╪════════╪═════╪══════════╡
│ HR  

In [13]:
# count of Employees per Department

# pandas 
print(pd_df["department"].value_counts())
print(pd_df.groupby("department")['employee'].count())

# dask 
print(dd_df["department"].value_counts().compute())

# polars
print(pl_df.group_by("department").agg(pl.len()))

department
IT         3
HR         2
Finance    2
Name: count, dtype: int64
department
Finance    2
HR         2
IT         3
Name: employee, dtype: int64
department
Finance    2
HR         2
IT         3
Name: count, dtype: int64[pyarrow]
shape: (3, 2)
┌────────────┬─────┐
│ department ┆ len │
│ ---        ┆ --- │
│ str        ┆ u32 │
╞════════════╪═════╡
│ IT         ┆ 3   │
│ Finance    ┆ 2   │
│ HR         ┆ 2   │
└────────────┴─────┘


### Join/ Merge
``` python 
# pandas 
pd_merged = pd.merge(pd_main, pd_bonus, on = "employee", how = "left")

#dask
df_merged = dd.merge(dd_main, dd_bonus, on="employee", how="left").compute()

#polars
pl_merged = pl_main.join(pl.bonus, on = "employee", how = "left")

```

### Pivot Table/Melt


In [27]:
# Pandas
print(pd.pivot_table(pd_df, values="salary", index="department", aggfunc="mean"))
print(pd_df.melt(id_vars=["employee"], value_vars=["salary", "age"]))

#pivot is directly not supported in dask
#pivot logic must be implemented manually 

# Polars
print(pl_df.pivot("salary", index="department", values="employee", aggregate_function="first"))
print(pl_df.unpivot(index ="employee", on=["salary", "age"]))


             salary
department         
Finance     58500.0
HR          51000.0
IT          61000.0
   employee variable  value
0     Alice   salary  50000
1       Bob   salary  52000
2   Charlie   salary  60000
3     David   salary  61000
4       Eve   salary  58000
5     Frank   salary  59000
6     Grace   salary  62000
7     Alice      age     25
8       Bob      age     30
9   Charlie      age     28
10    David      age     35
11      Eve      age     40
12    Frank      age     38
13    Grace      age     27
shape: (3, 8)
┌────────────┬───────┬───────┬─────────┬───────┬───────┬───────┬───────┐
│ department ┆ 50000 ┆ 52000 ┆ 60000   ┆ 61000 ┆ 58000 ┆ 59000 ┆ 62000 │
│ ---        ┆ ---   ┆ ---   ┆ ---     ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ str        ┆ str   ┆ str   ┆ str     ┆ str   ┆ str   ┆ str   ┆ str   │
╞════════════╪═══════╪═══════╪═════════╪═══════╪═══════╪═══════╪═══════╡
│ HR         ┆ Alice ┆ Bob   ┆ null    ┆ null  ┆ null  ┆ null  ┆ null  │
│ IT         ┆ null  ┆ null 