# Comparative Performance Analysis of Pandas, NumPy, and Polars

This notebook contains performance, memory usage, and CPU utilization tests comparing three popular Python data processing libraries: **Pandas**, **Numpy**, and **Polars**.

The goal is to understand which library is more efficient in handling large datasets and common data operations.

*Test environment:* 
- **OS**: Windows 11 
- **CPU**: AMD Ryzen 7 5800 H With Radeon Graphics
- **RAM**: 16 GB
- **Python Version**: 3.13.5
- **Dedicated GPU**: NVIDIA GeForce RTX 3050

---

*Author:* Sujal Rijal  
*Date:* July 2025


## Environment Setup

Below are the versions of libraries and Python used for testing.


In [3]:
import sys
import pandas as pd
import numpy as np
import polars as pl

print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Polars version: {pl.__version__}")


Python version: 3.13.5 (tags/v3.13.5:6cb20a2, Jun 11 2025, 16:15:46) [MSC v.1943 64 bit (AMD64)]
Pandas version: 2.3.0
NumPy version: 2.3.1
Polars version: 1.31.0


In [None]:
# Test Data


# large data
def generate_test_data(rows=1_000_000):
    np.random.seed(42)
    data = {
        'id': np.arange(rows),
        'category': np.random.choice(['A', 'B', 'C', 'D'], rows),
        'value': np.random.randn(rows),
        'flag': np.random.choice([True, False], rows)
    }
    return data

data = generate_test_data()


# Small sample data
data_example = {
    'category': ['A', 'B', 'A', 'C', 'B', 'C', 'A'],
    'value': [10, -5, 20, 0, 15, -3, 8]
}



## **EXECUTION TIME**

In [None]:

import time

def pandas_test(data):
    df = pd.DataFrame(data)

    filtered = df[df['value'] > 0]

    grouped = df.groupby('category')['value'].agg(['sum', 'mean', 'count'])

    sorted_df = df.sort_values(['category', 'value'])
    return filtered, grouped, sorted_df


def polars_test(data):
    df = pl.DataFrame(data)
    filtered = df.filter(pl.col('value') > 0)
    grouped = df.group_by('category').agg([
        pl.col('value').sum().alias('sum'),
        pl.col('value').mean().alias('mean'),
        pl.col('value').count().alias('count')
    ])
    sorted_df = df.sort(['category', 'value'])
    return filtered, grouped, sorted_df


def numpy_test(data):
    values = np.array(data['value'])
    categories = np.array(data['category'])

    filtered = values[values > 0]
    unique_cats = np.unique(categories)
    sums = {}
    means = {}
    counts = {}
    for cat in unique_cats:
        cat_mask = categories == cat
        sums[cat] = values[cat_mask].sum()
        means[cat] = values[cat_mask].mean()
        counts[cat] = np.sum(cat_mask)
    return filtered, sums, means, counts

print("(Small Data)")

start = time.time()
pandas_res = pandas_test(data_example)
end = time.time()
print(f"Pandas test took: {end - start:.4f} seconds")

start = time.time()
numpy_res = numpy_test(data_example)
end = time.time()
print(f"NumPy test took: {end - start:.4f} seconds")

start = time.time()
polars_res = polars_test(data_example)
end = time.time()
print(f"Polars test took: {end - start:.4f} seconds")

print("\n(Large Data)")
start = time.time()
pandas_res = pandas_test(data)
end = time.time()
print(f"Pandas test took: {end - start:.4f} seconds")

start = time.time()
numpy_res = numpy_test(data)
end = time.time()
print(f"NumPy test took: {end - start:.4f} seconds")

start = time.time()
polars_res = polars_test(data)
end = time.time()
print(f"Polars test took: {end - start:.4f} seconds")

(Small Data)
Pandas test took: 0.0088 seconds
NumPy test took: 0.0005 seconds
Polars test took: 0.0052 seconds

(Large Data)
Pandas test took: 0.8853 seconds
NumPy test took: 0.3162 seconds
Polars test took: 0.5737 seconds


## Analysis Of Result

---

### **Small Data (8 rows):**
- **Numpy: 0.0005 seconds** – Fastest 
- **Polars: 0.0052 seconds** – Middle 
- **Pandas: 0.0088 seconds** – Slowest 

---

### **Large Data (100,000 rows):**
- **Numpy: 0.3162 seconds** – Still the fastest! 
- **Polars: 0.5737 seconds** – Good performance with DataFrame functionality 
- **Pandas: 0.8853 seconds** – Slowest again, but still performs reasonably well

---

### **Key Observations:**

1. **Numpy consistently wins on speed**
   - Pure numerical operations on arrays
   - No DataFrame overhead
   - Backed by highly optimized C implementations

2. **Polars vs Pandas performance gap**
   - Polars is ~35% faster than Pandas on large datasets
   - This is due to Polars' efficient, multi-threaded design

3. **Scaling behavior**
   - **Numpy**: ~632x slower (0.0005 → 0.3162)
   - **Polars**: ~110x slower (0.0052 → 0.5737)
   - **Pandas**: ~100x slower (0.0088 → 0.8853)

4. **Performance matches CPU usage trends**
   - Numpy: Fastest and lowest CPU usage
   - Polars: Higher CPU usage but great speed
   - Pandas: Slower, less optimized CPU use

---

### Conclusion:
For pure numerical tasks, **Numpy is unbeatable**. But if you're working with DataFrames, **Polars** offers a major performance improvement over **Pandas** — making it a strong modern alternative.


## **Memory Usage Check**

In [None]:
from memory_profiler import memory_usage

def test_memory(func, data, lib_name):
    
    mem_usage = memory_usage((func, (data,)), interval=0.1)
    baseline = mem_usage[0]  
    peak = max(mem_usage)    
    memory_used = peak - baseline
    print(f"Memory used by {lib_name}: {memory_used:.4f} MiB")

print("(Small Data)")
test_memory(pandas_test, data_example, "Pandas")
test_memory(polars_test, data_example, "Polars")
test_memory(numpy_test, data_example, "NumPy")

print("\n(Large Data)")
test_memory(pandas_test, data, "Pandas")
test_memory(polars_test, data, "Polars")
test_memory(numpy_test, data, "NumPy")

(Small Data)
Memory used by Pandas: 0.0000 MiB
Memory used by Polars: 0.0000 MiB
Memory used by NumPy: 0.0039 MiB

(Large Data)
Memory used by Pandas: 97.0703 MiB
Memory used by Polars: 77.3633 MiB
Memory used by NumPy: 19.0781 MiB


I was honestly shocked by these results! Especially since I haven't used polars before, but now i am really eager to dive into it and explore its capabilities.

**Small Data (8 rows):**
- **Pandas & Polars: 0.0000 MiB** - The dataset is so tiny that any memory allocation is below the measurement threshold (0.1 second intervals might miss quick allocations)
- **Numpy: 0.0039 MiB** - Shows a small measurable increase, likely from array conversions and dictionary operations

**Large Data (100,000 rows):**
- **Pandas: 97.0703 MiB** - Highest memory usage, which is expected because:
  - Creates multiple DataFrame copies during operations (filtering, grouping, sorting)
  - Less memory-efficient for large datasets
  - Intermediate objects during chained operations

- **Polars: 77.3633 MiB** - Middle ground, which makes sense because:
  - More memory-efficient than Pandas
  - Uses lazy evaluation and optimized memory management
  - Better at avoiding unnecessary copies

- **Numpy: 19.0781 MiB** - Lowest memory usage, which is correct because:
  - Works directly with arrays (more memory-efficient than DataFrames)
  - No overhead from DataFrame structures
  - Simple dictionary storage for grouped results

So the memory usage ranking here is Numpy < Polars < Pandas m exactly what i’d expect, but seeing Polars perform so efficiently really blew me away.It shows why Numpy is great for pure numerical tasks, but Polars is quickly becoming the exciting new alternative to Pandas for larger, more complex data handling.



## **CPU Usage Check**

In [None]:
import psutil
import time
import threading

def test_cpu(func, data, lib_name):
    cpu_percentages = []
    monitoring = [True]
    
    def monitor_cpu():
        
        psutil.cpu_percent(interval=None)  
        while monitoring[0]:
            cpu = psutil.cpu_percent(interval=0.1)
            if cpu > 0:  
                cpu_percentages.append(cpu)
    

    monitor_thread = threading.Thread(target=monitor_cpu)
    monitor_thread.daemon = True  
    monitor_thread.start()
    
    
    time.sleep(0.1)
    
    
    start_time = time.time()
    func(data)
    end_time = time.time()
    
    
    monitoring[0] = False
    monitor_thread.join(timeout=1)  
    
    
    if cpu_percentages:
        avg_cpu = sum(cpu_percentages) / len(cpu_percentages)
        max_cpu = max(cpu_percentages)
    else:
        avg_cpu = 0
        max_cpu = 0
    
    elapsed = end_time - start_time
    
    print(f"CPU usage by {lib_name}: Avg {avg_cpu:.2f}%, Max {max_cpu:.2f}% during {elapsed:.4f} seconds")

# Example usage:
print("(Small Data)")
test_cpu(pandas_test, data_example, "Pandas")
test_cpu(polars_test, data_example, "Polars")
test_cpu(numpy_test, data_example, "NumPy")

print("\n(Large Data)")
test_cpu(pandas_test, data, "Pandas")
test_cpu(polars_test, data, "Polars")
test_cpu(numpy_test, data, "NumPy")

(Small Data)
CPU usage by Pandas: Avg 22.60%, Max 33.00% during 0.0055 seconds
CPU usage by Polars: Avg 28.00%, Max 51.60% during 0.0019 seconds
CPU usage by NumPy: Avg 12.80%, Max 12.80% during 0.0006 seconds

(Large Data)
CPU usage by Pandas: Avg 19.93%, Max 45.50% during 1.0512 seconds
CPU usage by Polars: Avg 29.18%, Max 69.80% during 0.5457 seconds
CPU usage by NumPy: Avg 13.44%, Max 16.50% during 0.3126 seconds


## Result Analysis

**Small Data Results:**
- **Pandas**: Moderate CPU usage (22.60% avg, 33% max) with relatively slow execution (0.0055s)
- **Polars**: Higher CPU usage (28% avg, 51.60% max) but fastest execution (0.0019s) 
- **Numpy**: Lowest CPU usage (12.80% avg/max) and very fast (0.0006s)

**Large Data Results:**
- **Pandas**: Moderate CPU usage (19.93% avg, 45.50% max) but slowest execution (1.0512s)
- **Polars**: Highest CPU usage (29.18% avg, 69.80% max) but fastest execution (0.5457s)
- **Numpy**: Lowest CPU usage (13.44% avg, 16.50% max) with good speed (0.3126s)

**Explanation**

1. **Polars shows high CPU usage** 
   - It's aggressively optimized and uses multi-threading
   - Higher CPU utilization often means better performance (doing more work per unit time)
   - The high max CPU (69.80%) suggests it's effectively using available cores

2. **NumPy has low CPU usage** 
   - It's doing simpler operations (arrays vs DataFrames)
   - Less overhead from complex data structures
   - More efficient memory access patterns

3. **Pandas is in the middle** 
   - It's single-threaded for most operations
   - Has more overhead than NumPy but less optimization than Polars

The key insight: **Polars trades higher CPU usage for significantly better execution time**, which is exactly what we want in a performance library. The results demonstrate that Polars is effectively utilizing system resources to deliver faster results.


### **My Experience Using Pandas and Numpy**

Especially in my data analytics projects, I have worked frequently with **Pandas** for tasks like **data cleaning**, **analysis**, and creating **summaries** from huge data sets.

I mostly used **Numpy** to work with **arrays**, perform **fast numerical operations**, and enhance effectiveness in my code's **data-heavy parts**.

A **GitHub-based data analytics dashboard** was one of my projects from pool of projects, where I examined **product data**, displayed **trends**, and developed a **Django REST API** to provide **insights**.


## **Conclusion**

From the tests performed:

- **Polars** generally offers faster execution times and better memory efficiency for large datasets.
- **Pandas** is highly versatile and well-supported but can be slower on large data.
- **Numpy** is excellent for pure numerical tasks but lacks the high-level data manipulation features.

These insights can help in choosing the right library depending on project requirements.
