# 🧪 Big Data Visualization Failure Demo
This notebook demonstrates what happens when you try to directly visualize a dataset that's too large (10 million points), and then introduces smarter alternatives like sampling and aggregation.

## ⚠️ Step 1: Generate a Huge Dataset (10 Million Points)

In [None]:
import pandas as pd
import numpy as np

# Simulate 10 million random points
n = 10_000_000
df = pd.DataFrame({
    "x": np.random.normal(0, 1, size=n),
    "y": np.random.normal(0, 1, size=n)
})
df.shape

## ❌ Step 2: Try Plotting All Points (Expect Lag or Crash)

In [None]:
import matplotlib.pyplot as plt

# WARNING: This may freeze the notebook or crash your kernel
plt.scatter(df["x"], df["y"], alpha=0.1, s=1)
plt.title("10 Million Points (Do Not Do This!)")
plt.show()

## ✅ Step 3: Sample 0.1% of the Data for Plotting

In [None]:
df_sample = df.sample(frac=0.001)
plt.scatter(df_sample["x"], df_sample["y"], alpha=0.3, s=5)
plt.title("Sampled: 0.1% of 10 Million Points")
plt.show()

## ✅ Step 4: Use Hexbin Aggregation

In [None]:
plt.hexbin(df["x"], df["y"], gridsize=50, cmap='Blues')
plt.colorbar(label='count in bin')
plt.title("Hexbin Aggregation (All Data)")
plt.show()

## 💬 Discussion
- Why does plotting everything fail?
- How does sampling preserve the structure?
- What other scalable visual techniques exist?