# Performance Optimization in Large Datasets

### What is Performance Optimization in Large Datasets?

Performance optimization in the context of large datasets refers to **techniques and strategies used to efficiently handle, process, and visualize data** that is too big to work with naively. When datasets contain **millions of rows or hundreds of columns**, standard operations like plotting, aggregating, or computing statistics can become **slow, memory-intensive, or even crash our system**.

Optimization involves **reducing computational overhead**, **using memory efficiently**, and **leveraging tools that handle large-scale data** effectively. This is critical in **data analysis, AI/ML, and real-time applications**, where responsiveness and speed are essential.

Key strategies include:

- **Data Sampling:** Working with a representative subset instead of the full dataset.
- **Vectorized Operations:** Using libraries like **NumPy** and **Pandas** to replace Python loops.
- **Efficient Data Types:** Using `category` for strings, smaller integer/floats to save memory.
- **Chunk Processing:** Reading or processing data in small portions instead of loading everything at once.
- **Optimized Visualization:** Plotting only necessary data points or using faster plotting libraries like **Datashader** or **Plotly**.

### Why is this Important?

- Large datasets are common in **finance, e-commerce, IoT, and AI/ML** projects.
- Efficient processing saves **time and system resources**.
- Optimized code ensures **scalability** — the same logic can work as datasets grow.
- Helps in **real-time analytics** where data is constantly updating.

**Examples**

1. Using Efficient Data Types

In [None]:
import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("data/Superstore.csv", encoding='latin-1')

# Check memory usage
print(df.info(memory_usage="deep"))

# Optimize datatypes
df["Category"] = df["Category"].astype("category")
df["Sub-Category"] = df["Sub-Category"].astype("category")
df["Order ID"] = df["Order ID"].astype("string")
df["Quantity"] = df["Quantity"].astype("int16")
df["Sales"] = df["Sales"].astype("float32")

print(df.info(memory_usage="deep"))

2. Vectorized Operations instead of Loops

In [None]:
# Slow loop approach
profit_ratio = []
for sale, profit in zip(df["Sales"], df["Profit"]):
    profit_ratio.append(profit / sale)

# Optimized vectorized approach
profit_ratio_vec = df["Profit"] / df["Sales"]

3. Processing in Chunks

In [None]:
chunksize = 100000
total_sales = 0

for chunk in pd.read_csv("large_dataset.csv", chunksize=chunksize):
    total_sales += chunk["Sales"].sum()

print("Total Sales:", total_sales)

4. Sampling for Fast Visualization

In [None]:
sample_df = df.sample(frac=0.05, random_state=42)

import matplotlib.pyplot as plt
plt.scatter(sample_df["Sales"], sample_df["Profit"], alpha=0.5)
plt.xlabel("Sales")
plt.ylabel("Profit")
plt.title("Sales vs Profit (Sampled Data)")
plt.show()

5. Using Aggregation for Large Data

In [None]:
sales_by_category = df.groupby("Category")["Sales"].sum().reset_index()
print(sales_by_category)

### Exercises

Q1. Load a large CSV in chunks and calculate total Profit.

In [None]:
chunksize = 100000
total_profit = 0
for chunk in pd.read_csv("large_dataset.csv", chunksize=chunksize):
    total_profit += chunk["Profit"].sum()
print("Total Profit:", total_profit)

Q2. Convert string columns to `category` and check memory reduction.

In [None]:
df["Category"] = df["Category"].astype("category")
df["Sub-Category"] = df["Sub-Category"].astype("category")
print(df.info(memory_usage="deep"))

Q3. Sample 10% of data and plot Sales vs Discount.

In [None]:
sample_df = df.sample(frac=0.1, random_state=42)
import matplotlib.pyplot as plt
plt.scatter(sample_df["Sales"], sample_df["Discount"], alpha=0.5)
plt.xlabel("Sales")
plt.ylabel("Discount")
plt.title("Sales vs Discount (Sampled)")
plt.show()

Q4. Use vectorized operations to compute Profit Ratio (Profit/Sales).

In [None]:
profit_ratio = df["Profit"] / df["Sales"]
print(profit_ratio.head())

Q5. Aggregate total Quantity by Region and Category efficiently.

In [None]:
agg_data = df.groupby(["Region", "Category"])["Quantity"].sum().reset_index()
print(agg_data)

### Summary

Performance optimization in large datasets ensures that **data analysis and visualization remain fast, efficient, and scalable**. By using techniques like **data sampling, vectorized operations, optimized data types, chunk processing, and aggregation**, we can handle millions of rows without crashing the system. This is crucial in **AI/ML pipelines, real-time analytics, and business intelligence dashboards**, where both **speed and memory efficiency** matter. Sampling and aggregation allow analysts to **explore trends quickly**, while vectorized operations and optimized types reduce computational overhead. Ultimately, mastering performance optimization allows us to **work confidently with large-scale datasets**, extract insights faster, and build scalable data-driven solutions.