# Python + Rust: Pandas vs Polars
## Architectural Analysis of Data Processing Performance

Comparing three fundamentally different approaches:
- **Pandas (NumPy)**: Python + Monothread C (eager evaluation)
- **Pandas + PyArrow**: Python + Columnar Memory (eager evaluation)
- **Polars (Rust)**: Rust + Multithreading + Lazy Evaluation + Query Optimization

In [11]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from pathlib import Path
import warnings
import numpy as np

warnings.filterwarnings('ignore')

# Load results
results_dir = Path('results')

# Load only files that exist
dfs = []
engines_present = []

if (results_dir / 'pandas.csv').exists():
    pandas_df = pd.read_csv(results_dir / 'pandas.csv')
    pandas_df['engine'] = 'Pandas (NumPy)'
    dfs.append(pandas_df)
    engines_present.append('Pandas (NumPy)')

if (results_dir / 'pandas-pyarrow.csv').exists():
    pandas_pyarrow_df = pd.read_csv(results_dir / 'pandas-pyarrow.csv')
    pandas_pyarrow_df['engine'] = 'Pandas + PyArrow'
    dfs.append(pandas_pyarrow_df)
    engines_present.append('Pandas + PyArrow')

if (results_dir / 'polars.csv').exists():
    polars_df = pd.read_csv(results_dir / 'polars.csv')
    polars_df['engine'] = 'Polars (Rust)'
    dfs.append(polars_df)
    engines_present.append('Polars (Rust)')

if not dfs:
    raise FileNotFoundError("Nenhum arquivo CSV encontrado em results/. Execute 'uv run demo benchmark' primeiro.")

# Combine all results
combined_df = pd.concat(dfs, ignore_index=True)

# Average results by scenario and engine to handle multiple runs
combined_df = combined_df.groupby(['scenario', 'engine'], as_index=False).agg({
    'time_seconds': 'mean',
    'memory_mb': 'mean'
})

# Scenario mapping
all_scenarios = ['small', 'medium', 'large', 'xlarge']
all_labels = ['Small (1K)', 'Medium (1M)', 'Large (10M)', 'XLarge (100M)']

# Filter to only scenarios that exist in data
present_scenarios = combined_df['scenario'].unique()
scenario_order = [s for s in all_scenarios if s in present_scenarios]
scenario_labels = [all_labels[all_scenarios.index(s)] for s in scenario_order]
scenario_map = dict(zip(scenario_order, scenario_labels))
combined_df['scenario_label'] = combined_df['scenario'].map(scenario_map)

print(f"Engines encontrados: {', '.join(engines_present)}")
print(f"Cenários encontrados: {', '.join(scenario_order)}")
print("\nResultados combinados (médias):")
print(combined_df.to_string(index=False))

Engines encontrados: Pandas (NumPy), Pandas + PyArrow, Polars (Rust)
Cenários encontrados: small, medium, large, xlarge

Resultados combinados (médias):
scenario           engine  time_seconds    memory_mb scenario_label
   large   Pandas (NumPy)     11.344068  2091.189024    Large (10M)
   large Pandas + PyArrow      3.917661  1440.267561    Large (10M)
   large    Polars (Rust)      0.293151     0.020244    Large (10M)
  medium   Pandas (NumPy)      1.582817   252.575122    Medium (1M)
  medium Pandas + PyArrow      0.936156   211.091220    Medium (1M)
  medium    Polars (Rust)      0.047100     0.017561    Medium (1M)
   small   Pandas (NumPy)      0.457220    57.996098     Small (1K)
   small Pandas + PyArrow      0.330210    57.865366     Small (1K)
   small    Polars (Rust)      0.015354     0.000976     Small (1K)
  xlarge   Pandas (NumPy)    162.222210 20316.686829  XLarge (100M)
  xlarge Pandas + PyArrow     33.711366 13571.578537  XLarge (100M)
  xlarge    Polars (Rust)      

## 1. The Polars Takeoff: Execution Time (Logarithmic Scale)

**Why Logarithmic?** With Polars being 31x faster, a linear scale would make the differences invisible. The log scale reveals the **architectural inflection point**: from Medium (1M) onwards, Polars separates from the pack. At 100M rows, it's an insurmountable gap.

**Key Message**: Only Rust/Multithreading can scale this way.

In [12]:
xlarge_time_data = combined_df[combined_df['scenario'] == 'xlarge'].sort_values('time_seconds')
fastest_time = xlarge_time_data['time_seconds'].iloc[0]
slowest_time = xlarge_time_data['time_seconds'].iloc[-1]
speedup_ratio = slowest_time / fastest_time

fig_time_log = px.bar(
    combined_df.sort_values(['scenario', 'engine']),
    x='scenario_label',
    y='time_seconds',
    color='engine',
    barmode='group',
    title='Execution Time: The Polars Architectural Advantage (Log Scale)',
    labels={'time_seconds': 'Time (seconds, log scale)', 'scenario_label': 'Dataset Size'},
    color_discrete_map={
        'Pandas (NumPy)': '#1f77b4',
        'Pandas + PyArrow': '#ff7f0e',
        'Polars (Rust)': '#2ca02c'
    },
    category_orders={'scenario_label': scenario_labels},
    hover_data={'time_seconds': ':.4f', 'scenario': False}
)

fig_time_log.update_yaxes(type='log')

fig_time_log.update_layout(
    height=700,
    template='plotly_white',
    font=dict(size=13),
    hovermode='x unified',
    showlegend=True,
    legend=dict(x=0.02, y=0.98, bgcolor='rgba(255,255,255,0.8)'),
    margin=dict(t=120)
)

fig_time_log.add_annotation(
    text=f"<b>At 100M rows: Polars is {speedup_ratio:.1f}x faster than Pandas</b><br>Multithreading + Lazy Evaluation wins",
    xref="paper", yref="paper",
    x=0.5, y=1.08, showarrow=False,
    font=dict(size=13, color='#2ca02c'),
    bgcolor='rgba(255,255,200,0.8)',
    bordercolor='#2ca02c',
    borderwidth=2,
    borderpad=10
)

fig_time_log.show()

## 2. The Memory Wall: Peak RAM Usage (Large Scenarios Only)

**Architectural Insight**: At 100M rows, watch the bar heights:
- **Pandas bar**: 20 GB (ceiling)
- **Pandas+PyArrow bar**: 13 GB (still massive)
- **Polars bar**: Invisible (30 KB)

**The 677,000x difference** (20 GB vs 30 KB) proves Lazy Pushdown works: Polars only loads what's needed. Pandas' eager evaluation forces full file allocation plus overhead.

**This is why we over-provision hardware for Pandas applications.**

In [13]:
large_scenarios = combined_df[combined_df['scenario'].isin(['large', 'xlarge'])].copy()
large_scenarios['memory_gb'] = large_scenarios['memory_mb'] / 1024

xlarge_memory = combined_df[combined_df['scenario'] == 'xlarge'].sort_values('memory_mb')
polars_memory_kb = xlarge_memory[xlarge_memory['engine'] == 'Polars (Rust)']['memory_mb'].iloc[0] * 1024
pandas_memory_gb = xlarge_memory[xlarge_memory['engine'] == 'Pandas (NumPy)']['memory_mb'].iloc[0] / 1024
memory_ratio = (xlarge_memory[xlarge_memory['engine'] == 'Pandas (NumPy)']['memory_mb'].iloc[0]) / (xlarge_memory[xlarge_memory['engine'] == 'Polars (Rust)']['memory_mb'].iloc[0] + 0.0001)

fig_memory = px.bar(
    large_scenarios.sort_values(['scenario', 'engine']),
    x='scenario_label',
    y='memory_gb',
    color='engine',
    barmode='group',
    title='Peak Memory Usage: The Lazy Evaluation Advantage (10M & 100M Rows)',
    labels={'memory_gb': 'Memory (GB)', 'scenario_label': 'Dataset Size'},
    color_discrete_map={
        'Pandas (NumPy)': '#1f77b4',
        'Pandas + PyArrow': '#ff7f0e',
        'Polars (Rust)': '#2ca02c'
    },
    category_orders={'scenario_label': ['Large (10M)', 'XLarge (100M)']},
    hover_data={'memory_gb': ':.3f', 'memory_mb': ':.0f'}
)

fig_memory.update_layout(
    height=700,
    template='plotly_white',
    font=dict(size=13),
    hovermode='x unified',
    showlegend=True,
    legend=dict(x=0.02, y=0.98, bgcolor='rgba(255,255,255,0.8)'),
    margin=dict(t=120)
)

fig_memory.add_annotation(
    text=f"<b>At 100M rows: Polars uses {polars_memory_kb:.0f} KB vs Pandas' {pandas_memory_gb:.1f} GB</b><br>That's a {memory_ratio:,.0f}x difference! Lazy evaluation proves its worth.",
    xref="paper", yref="paper",
    x=0.5, y=1.08, showarrow=False,
    font=dict(size=13, color='#d62728'),
    bgcolor='rgba(255,200,200,0.8)',
    bordercolor='#d62728',
    borderwidth=2,
    borderpad=10
)

fig_memory.show()

## 3. The Hidden Cost: RAM Overhead Factor (100M Scenario)

**What is Overhead Factor?** The ratio of peak RAM used ÷ file size on disk.

For a ~500 MB parquet file @ 100M rows:
- **Pandas**: Requires ~40x the file size in RAM (eager loading + memory representation overhead)
- **Pandas + PyArrow**: Requires ~27x the file size in RAM (columnar helps, but still eager)
- **Polars**: Requires ~0x the file size in RAM (lazy evaluation + smart query optimization)

**This explains infrastructure costs**: Why Hadoop/Spark clusters need 10x the storage in memory. Pandas forces this over-provisioning.

In [14]:
fact_file = Path('data') / 'fact_content_performance_xlarge.parquet'

if fact_file.exists():
    file_size_mb = fact_file.stat().st_size / (1024 * 1024)
    print(f"Actual file size: {file_size_mb:.1f} MB")
else:
    file_size_mb = 500
    print(f"Using estimated file size: {file_size_mb:.1f} MB")

xlarge_data = combined_df[combined_df['scenario'] == 'xlarge'].copy()
xlarge_data['overhead_factor'] = xlarge_data['memory_mb'] / file_size_mb

pandas_overhead = xlarge_data[xlarge_data['engine'] == 'Pandas (NumPy)']['overhead_factor'].iloc[0]
pandas_memory_gb = xlarge_data[xlarge_data['engine'] == 'Pandas (NumPy)']['memory_mb'].iloc[0] / 1024

fig_overhead = px.bar(
    xlarge_data,
    x='engine',
    y='overhead_factor',
    title=f'The Real Cost: RAM Overhead Factor @ 100M Rows<br><sub>({file_size_mb:.0f} MB file = how much RAM needed?)</sub>',
    labels={'overhead_factor': 'RAM Used ÷ File Size (multiplier)', 'engine': 'Engine'},
    color='engine',
    color_discrete_map={
        'Pandas (NumPy)': '#1f77b4',
        'Pandas + PyArrow': '#ff7f0e',
        'Polars (Rust)': '#2ca02c'
    },
    text='overhead_factor',
    hover_data={'overhead_factor': ':.1f', 'memory_mb': ':.0f'}
)

fig_overhead.update_traces(textposition='outside', texttemplate='<b>%{text:.1f}x</b>')

fig_overhead.update_layout(
    height=700,
    template='plotly_white',
    font=dict(size=13),
    showlegend=False,
    hovermode='x unified',
    yaxis_title='Overhead Factor (RAM/FileSize)',
    margin=dict(t=120)
)

fig_overhead.add_annotation(
    text=f"<b>Pandas costs {pandas_overhead:.0f}x more memory than the file size.</b><br>For {file_size_mb:.0f} MB data, Pandas needs {pandas_memory_gb:.1f} GB. This forces data center over-provisioning.",
    xref="paper", yref="paper",
    x=0.5, y=1.08, showarrow=False,
    font=dict(size=13, color='#8B0000'),
    bgcolor='rgba(255,220,220,0.9)',
    bordercolor='#8B0000',
    borderwidth=2,
    borderpad=10
)

fig_overhead.show()

Actual file size: 509.9 MB


## Summary: The Architectural Case for Polars

| Metric | Pandas (NumPy) | Pandas+PyArrow | Polars (Rust) |
|--------|---|---|---|
| **Speed @ 100M** | Baseline | 4.1x faster | **31.1x faster** |
| **Memory @ 100M** | 20 GB | 13.6 GB | **30 KB** |
| **Overhead Factor** | 40x file size | 27x file size | **~0x file size** |
| **Threading Model** | Monothread (GIL) | Monothread (GIL) | **Multithreaded** |
| **Execution Model** | Eager | Eager | **Lazy (optimized)** |
| **Query Optimization** | None | None | **Predicate Pushdown** |
| **Infrastructure Cost** | High over-provisioning | High over-provisioning | **Efficient scaling** |

### Bottom Line
**Polars isn't just faster—it's architecturally superior for large datasets.** The combination of Rust (no GIL), multithreading, and lazy evaluation with query optimization eliminates the memory wall that defines Pandas' scalability limits.

In [15]:
print("\n" + "="*100)
print("PERFORMANCE COMPARISON ACROSS ALL SCENARIOS")
print("="*100)

for scenario in scenario_order:
    scenario_data = combined_df[combined_df['scenario'] == scenario].copy()
    scenario_label = scenario_map.get(scenario, scenario)
    
    if scenario_data.empty:
        continue
    
    scenario_data_sorted = scenario_data.sort_values('time_seconds')
    
    print(f"\n### {scenario_label} ###")
    
    fastest_time = scenario_data_sorted['time_seconds'].iloc[0]
    
    for _, row in scenario_data_sorted.iterrows():
        speedup = fastest_time / row['time_seconds'] if row['time_seconds'] > 0 else float('inf')
        
        memory_str = f"{row['memory_mb']:7.0f} MB"
        
        if row['memory_mb'] > 0:
            file_scenario = f"fact_content_performance_{scenario}.parquet"
            fact_file_scenario = Path('data') / file_scenario
            if fact_file_scenario.exists():
                file_size_mb_scenario = fact_file_scenario.stat().st_size / (1024 * 1024)
                overhead = row['memory_mb'] / file_size_mb_scenario
                overhead_str = f"({overhead:5.1f}x file size)"
            else:
                overhead_str = ""
        else:
            overhead_str = "(negligible)"
        
        print(f"  {row['engine']:20} | Time: {row['time_seconds']:7.1f}s ({speedup:5.1f}x) | Memory: {memory_str} {overhead_str}")

print("\n" + "="*100)


PERFORMANCE COMPARISON ACROSS ALL SCENARIOS

### Small (1K) ###
  Polars (Rust)        | Time:     0.0s (  1.0x) | Memory:       0 MB (  0.1x file size)
  Pandas + PyArrow     | Time:     0.3s (  0.0x) | Memory:      58 MB (5816.9x file size)
  Pandas (NumPy)       | Time:     0.5s (  0.0x) | Memory:      58 MB (5830.1x file size)

### Medium (1M) ###
  Polars (Rust)        | Time:     0.0s (  1.0x) | Memory:       0 MB (  0.0x file size)
  Pandas + PyArrow     | Time:     0.9s (  0.1x) | Memory:     211 MB ( 41.6x file size)
  Pandas (NumPy)       | Time:     1.6s (  0.0x) | Memory:     253 MB ( 49.8x file size)

### Large (10M) ###
  Polars (Rust)        | Time:     0.3s (  1.0x) | Memory:       0 MB (  0.0x file size)
  Pandas + PyArrow     | Time:     3.9s (  0.1x) | Memory:    1440 MB ( 28.3x file size)
  Pandas (NumPy)       | Time:    11.3s (  0.0x) | Memory:    2091 MB ( 41.0x file size)

### XLarge (100M) ###
  Polars (Rust)        | Time:     5.2s (  1.0x) | Memory:       0 