# Performance Optimization with NumPy and Pandas

Performance optimization in **NumPy** and **Pandas** is about making data operations faster and more memory-efficient.

Key strategies:

## **In NumPy**
- **Vectorization**: Use built-in array operations instead of Python loops.
- **Broadcasting**: Apply operations between arrays of different shapes without explicit loops.
- **Preallocation**: Create arrays with a fixed size instead of dynamically appending.
- **Efficient functions**: Use NumPy's universal functions (`ufuncs`) like `np.add`, `np.sqrt`.

## **In Pandas**
- **Use vectorized methods**: Avoid `.apply()` with Python functions; use Pandas/NumPy functions instead.
- **Categorical data**: Convert repeated strings to `category` dtype to save memory.
- **Efficient indexing**: Set indexes to speed up lookups.
- **Chunking**: Process large files in smaller parts instead of loading all at once.
- **Avoid loops**: Pandas is optimized for column-wise operations.

Optimizing can make your code run **10x–100x faster** for large datasets.


In [10]:
import numpy as np
import pandas as pd
import time

In [11]:
# ---------------- NumPy Optimization ----------------
# Example: Squaring numbers from 1 to 10 million

N = 10_000_000
array = np.arange(N)

In [12]:
# Slow method: Python loop
start_time = time.time()
result_loop = [x**2 for x in array]
print("Loop time (Python list):", time.time() - start_time, "seconds")

Loop time (Python list): 3.3511130809783936 seconds


In [13]:
# Fast method: NumPy vectorized operation
start_time = time.time()
result_vectorized = array**2
print("Vectorized time (NumPy):", time.time() - start_time, "seconds")

Vectorized time (NumPy): 0.03218555450439453 seconds


In [14]:
# ---------------- Pandas Optimization ----------------
# Sample DataFrame
df = pd.DataFrame({
    "city": ["New York", "Los Angeles", "Chicago"] * 1_000_00,
    "sales": np.random.randint(100, 500, size=3_000_00)
})

In [15]:
# Convert 'city' to category for memory savings
df["city"] = df["city"].astype("category")

print("\nMemory usage after converting 'city' to category:")
print(df.memory_usage(deep=True))


Memory usage after converting 'city' to category:
Index        128
city      300305
sales    1200000
dtype: int64


In [16]:
# Vectorized calculation of sales tax (10%)
df["sales_tax"] = df["sales"] * 0.10
print("\nFirst rows with sales tax:\n", df.head())


First rows with sales tax:
           city  sales  sales_tax
0     New York    232       23.2
1  Los Angeles    135       13.5
2      Chicago    479       47.9
3     New York    119       11.9
4  Los Angeles    455       45.5


# Real-World Analogy: Cooking in Bulk vs. One at a Time

- **Loop method**: Like making each sandwich individually — get bread, put fillings, wrap it, repeat.
- **Vectorized method**: Like setting up an assembly line — prepare all bread slices, then add fillings to all, then wrap all sandwiches at once.

Vectorization and optimized methods help process large amounts of "ingredients" (data) far faster.
