# Lab 1.2.5: Profiling Exercise

**Module:** 1.2 - Python for AI/ML  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Profile Python code using cProfile and line_profiler
- [ ] Identify performance bottlenecks
- [ ] Optimize code using vectorization
- [ ] Achieve 10x+ speedup on real code

---

## üìö Prerequisites

- Completed: Labs 1.2.1-1.2.4
- Knowledge of: NumPy, broadcasting, basic algorithms

### Required Packages
- Python 3.9+
- NumPy >= 1.21
- cProfile (stdlib)
- tracemalloc (stdlib) or psutil (optional, for memory tracking)

### Optional Packages
- line_profiler (for line-by-line profiling)
  - Note: On DGX Spark (ARM64), use the version from NGC container or install via conda

---

## üåç Real-World Context

**Why profile code?**

"Premature optimization is the root of all evil" - Donald Knuth

But *informed* optimization is essential:
- Training a model in 2 hours vs 2 days
- Serving 1000 req/s vs 10 req/s
- Fitting a model on a laptop vs needing a cluster

**The profiling workflow:**
1. Write correct code first
2. Measure to find bottlenecks
3. Optimize only the slow parts
4. Verify correctness after optimization

---

## üßí ELI5: What is Profiling?

> **Imagine you're a detective solving "The Case of the Slow Code"...** üîç
>
> Your program takes 10 minutes to run. Where's the time going?
>
> **Without profiling:** You guess and optimize random things. Maybe it helps, maybe not.
>
> **With profiling:** You get a detailed report:
> - Function A: 0.1 seconds (1%)
> - Function B: 9.5 seconds (95%)  ‚Üê HERE'S YOUR CULPRIT!
> - Function C: 0.4 seconds (4%)
>
> Now you know exactly where to focus!

---

In [None]:
# ============================================================
# Environment Setup and Dependency Checks
# ============================================================
import sys
from pathlib import Path

# Determine the notebook's directory for reliable path resolution
try:
    notebook_dir = Path(__vsc_ipynb_file__).parent  # VS Code
except NameError:
    notebook_dir = Path.cwd()  # Fallback

# Add scripts directory to path (robust method)
scripts_dir = (notebook_dir / '../scripts').resolve()
if scripts_dir.exists() and str(scripts_dir) not in sys.path:
    sys.path.insert(0, str(scripts_dir))
elif not scripts_dir.exists():
    scripts_dir = Path('../scripts').resolve()
    if scripts_dir.exists() and str(scripts_dir) not in sys.path:
        sys.path.insert(0, str(scripts_dir))

import numpy as np
import time
import cProfile
import pstats
import io
from functools import wraps

print(f"Python version: {sys.version.split()[0]}")
print(f"NumPy version: {np.__version__}")

# Import our profiling utilities
# Note: memory_tracker requires either tracemalloc (stdlib, Python 3.4+) or psutil (pip install psutil)
# If neither is available, memory tracking will show a warning but won't fail
from profiling_utils import Timer, timeit, compare_implementations, memory_tracker

# Check memory tracking capability
try:
    import tracemalloc
    print("Memory tracking: tracemalloc available (stdlib)")
except ImportError:
    try:
        import psutil
        print("Memory tracking: psutil available")
    except ImportError:
        print("‚ö†Ô∏è Memory tracking limited - install psutil for full support: pip install psutil")

print(f"\n{'='*50}")
print("Welcome to the Profiling Exercise! üî¨")
print(f"{'='*50}")

---

## Part 1: Basic Timing

Let's start with simple timing before moving to profiling.

In [None]:
# The slow function we'll optimize
def compute_pairwise_distances_slow(points):
    """
    Compute Euclidean distance between all pairs of points.
    
    This is intentionally slow to demonstrate profiling!
    
    Args:
        points: Array of shape (n_points, n_dims)
    
    Returns:
        Distance matrix of shape (n_points, n_points)
    """
    n = len(points)
    distances = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            # Compute Euclidean distance
            diff = points[i] - points[j]
            squared_diff = diff ** 2
            sum_squared = np.sum(squared_diff)
            distances[i, j] = np.sqrt(sum_squared)
    
    return distances

# Test with small data
np.random.seed(42)
test_points = np.random.randn(100, 64).astype(np.float32)

print(f"Test data: {test_points.shape[0]} points, {test_points.shape[1]} dimensions")

In [None]:
# Time the slow version
with Timer("Slow pairwise distances (100 points)"):
    distances = compute_pairwise_distances_slow(test_points)

print(f"Result shape: {distances.shape}")
print(f"Sample distances: {distances[0, :5].round(2)}")

---

## Part 2: Profiling with cProfile

Let's see WHERE the time is being spent.

In [None]:
# Profile the slow function
def profile_function(func, *args, **kwargs):
    """Profile a function and return statistics."""
    profiler = cProfile.Profile()
    profiler.enable()
    
    result = func(*args, **kwargs)
    
    profiler.disable()
    
    # Format output
    stream = io.StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats('cumulative')
    stats.print_stats(15)  # Top 15 functions
    
    print(stream.getvalue())
    return result

print("Profiling the slow function...")
print("="*70)
_ = profile_function(compute_pairwise_distances_slow, test_points)

### üîç Reading the Profile Output

| Column | Meaning |
|--------|--------|
| ncalls | Number of times function was called |
| tottime | Total time IN this function (excluding sub-calls) |
| percall | tottime / ncalls |
| cumtime | Total time including sub-calls |

**What to look for:**
- Functions with high `tottime` are computation bottlenecks
- Functions with high `cumtime` but low `tottime` call slow functions
- Functions called many times (high `ncalls`) are loop candidates

---

## Part 3: Optimizing Step by Step

In [None]:
# Optimization 1: Reduce function call overhead
# Problem: np.sum() called 10,000 times!

def compute_pairwise_distances_v2(points):
    """Version 2: Inline the sum operation."""
    n = len(points)
    distances = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            diff = points[i] - points[j]
            # Use @ for dot product instead of sum of squares
            distances[i, j] = np.sqrt(diff @ diff)
    
    return distances

# Compare
results = compare_implementations(
    [compute_pairwise_distances_slow, compute_pairwise_distances_v2],
    ['V1: Original', 'V2: Inline sum'],
    args=(test_points,),
    n_runs=3
)

In [None]:
# Optimization 2: Vectorize inner loop
# Problem: Still iterating 10,000 times

def compute_pairwise_distances_v3(points):
    """Version 3: Vectorize inner loop."""
    n = len(points)
    distances = np.zeros((n, n))
    
    for i in range(n):
        # Compute all distances from point i at once
        diff = points - points[i]  # Broadcasting: (n, d) - (d,) = (n, d)
        distances[i] = np.sqrt(np.sum(diff ** 2, axis=1))
    
    return distances

# Compare
results = compare_implementations(
    [compute_pairwise_distances_slow, compute_pairwise_distances_v2, compute_pairwise_distances_v3],
    ['V1: Original', 'V2: Inline sum', 'V3: Vectorize inner'],
    args=(test_points,),
    n_runs=3
)

In [None]:
# Optimization 3: Full vectorization (no loops!)

def compute_pairwise_distances_v4(points):
    """Version 4: Fully vectorized using broadcasting."""
    # points: (n, d)
    # points[:, np.newaxis, :]: (n, 1, d)
    # points[np.newaxis, :, :]: (1, n, d)
    # Difference: (n, n, d)
    diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
    return np.sqrt(np.sum(diff ** 2, axis=2))

# Compare all versions
results = compare_implementations(
    [
        compute_pairwise_distances_slow,
        compute_pairwise_distances_v2,
        compute_pairwise_distances_v3,
        compute_pairwise_distances_v4
    ],
    ['V1: Original', 'V2: Inline', 'V3: Inner vec', 'V4: Full vec'],
    args=(test_points,),
    n_runs=3
)

In [None]:
# Optimization 4: Even smarter math!
# ||a - b||^2 = ||a||^2 + ||b||^2 - 2*a¬∑b

def compute_pairwise_distances_v5(points):
    """Version 5: Use the expansion trick for even better performance."""
    # ||a||^2 for all points
    sq_norms = np.sum(points ** 2, axis=1)
    
    # ||a - b||^2 = ||a||^2 + ||b||^2 - 2*a¬∑b
    # sq_norms[:, np.newaxis] broadcasts to (n, 1)
    # sq_norms[np.newaxis, :] broadcasts to (1, n)
    # points @ points.T gives all dot products (n, n)
    sq_distances = sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 2 * (points @ points.T)
    
    # Handle numerical errors (small negative values)
    sq_distances = np.maximum(sq_distances, 0)
    
    return np.sqrt(sq_distances)

# Final comparison!
print("\n" + "="*70)
print("FINAL COMPARISON - All optimization levels")
print("="*70 + "\n")

results = compare_implementations(
    [
        compute_pairwise_distances_slow,
        compute_pairwise_distances_v3,
        compute_pairwise_distances_v4,
        compute_pairwise_distances_v5
    ],
    ['V1: Nested loops', 'V3: Inner vectorized', 'V4: Full broadcasting', 'V5: Math trick'],
    args=(test_points,),
    n_runs=5
)

### üéâ Amazing! 

We achieved massive speedups through systematic optimization:
1. Profiled to find bottlenecks
2. Reduced function call overhead
3. Vectorized inner loop
4. Fully vectorized with broadcasting
5. Used smarter mathematics

---

## Part 4: Scaling Test

Let's see how the optimized version scales to larger data.

In [None]:
# Scale test with larger data
print("Scaling test: How do versions perform with more data?\n")

sizes = [100, 500, 1000, 2000]

for n in sizes:
    points = np.random.randn(n, 64).astype(np.float32)
    
    # Only test fast versions for large n
    if n <= 500:
        v1_result = timeit(compute_pairwise_distances_slow, args=(points,), n_runs=1, warmup=0)
        v1_str = f"{v1_result.mean_time*1000:.1f} ms"
    else:
        v1_str = "(too slow)"
    
    v4_result = timeit(compute_pairwise_distances_v4, args=(points,), n_runs=3)
    v5_result = timeit(compute_pairwise_distances_v5, args=(points,), n_runs=3)
    
    print(f"n={n:4d}: V1={v1_str:>12s}  V4={v4_result.mean_time*1000:6.1f} ms  V5={v5_result.mean_time*1000:6.1f} ms")

---

## Part 5: Memory Profiling

On DGX Spark with 128GB unified memory, memory matters!

In [None]:
# Memory comparison
n_large = 5000
points_large = np.random.randn(n_large, 64).astype(np.float32)

print(f"Input: {n_large} points √ó 64 dims = {points_large.nbytes / 1e6:.1f} MB")
print(f"Output: {n_large}√ó{n_large} matrix = {(n_large * n_large * 4) / 1e6:.1f} MB")
print()

# V4 needs intermediate (n, n, d) array
intermediate_v4 = n_large * n_large * 64 * 4  # float32
print(f"V4 intermediate array: {intermediate_v4 / 1e9:.2f} GB")

# V5 only needs (n, n) arrays
intermediate_v5 = n_large * n_large * 4 * 2  # Two (n,n) arrays
print(f"V5 intermediate arrays: {intermediate_v5 / 1e6:.1f} MB")

print(f"\nüí° V5 uses {intermediate_v4 / intermediate_v5:.0f}x less memory!")

In [None]:
# Test with memory tracking
print("\nMemory usage during computation:")

# V5 (memory efficient)
with memory_tracker("V5 (math trick)"):
    result_v5 = compute_pairwise_distances_v5(points_large)
del result_v5

# V4 - might need to skip if not enough memory
if n_large <= 3000:  # Safe size for V4
    with memory_tracker("V4 (broadcasting)"):
        result_v4 = compute_pairwise_distances_v4(points_large)
    del result_v4
else:
    print("‚ö†Ô∏è Skipping V4 - would use too much memory")

---

## Part 6: Your Turn - Optimize This Function!

### ‚úã Exercise: Optimize K-Nearest Neighbors

This slow function finds the k nearest neighbors for each point.

### üìö NumPy Functions for Efficient Top-K Selection

Before the exercise, let's learn two powerful functions for finding the k smallest/largest elements:

**`np.argpartition(arr, k)`** - Partially sorts array to find top-k indices
- Much faster than full sort: O(n) instead of O(n log n)
- Guarantees the k smallest elements are in positions `[:k]` (but not sorted among themselves)

**`np.take_along_axis(arr, indices, axis)`** - Select elements using indices from another array
- Perfect for gathering values based on argpartition/argsort results

In [None]:
# Demonstration: argpartition vs argsort for top-k selection

# Example array
arr = np.array([50, 10, 30, 80, 20, 60, 40])
k = 3  # Want the 3 smallest elements

print(f"Original array: {arr}")
print(f"Finding the {k} smallest elements...\n")

# Method 1: Full sort (O(n log n))
sorted_indices_full = np.argsort(arr)
top_k_full = arr[sorted_indices_full[:k]]
print(f"Using argsort (full sort):")
print(f"  All sorted indices: {sorted_indices_full}")
print(f"  Top-{k} values: {top_k_full}")

# Method 2: Partial sort (O(n)) - FASTER!
partitioned_indices = np.argpartition(arr, k)
top_k_partial = arr[partitioned_indices[:k]]
print(f"\nUsing argpartition (partial sort):")
print(f"  Partitioned indices: {partitioned_indices}")
print(f"  Top-{k} values (unordered): {top_k_partial}")

# If you need sorted within the k, sort just the small subset
within_k_sorted = np.argsort(top_k_partial)
top_k_sorted = top_k_partial[within_k_sorted]
print(f"  Top-{k} values (sorted): {top_k_sorted}")

# Using take_along_axis for 2D arrays
print("\n--- take_along_axis for 2D arrays ---")
matrix = np.array([[5, 2, 8], [1, 9, 3], [7, 4, 6]])
print(f"Matrix:\n{matrix}")

# Get indices of sorted values per row
sort_idx = np.argsort(matrix, axis=1)
print(f"\nSort indices per row:\n{sort_idx}")

# Use take_along_axis to get sorted values
sorted_matrix = np.take_along_axis(matrix, sort_idx, axis=1)
print(f"\nSorted rows using take_along_axis:\n{sorted_matrix}")

In [None]:
def knn_slow(query_points, reference_points, k):
    """
    Find k nearest neighbors for each query point.
    
    INTENTIONALLY SLOW - optimize this!
    
    Args:
        query_points: (n_queries, dims)
        reference_points: (n_reference, dims)
        k: number of neighbors
    
    Returns:
        indices: (n_queries, k) - indices of k nearest neighbors
        distances: (n_queries, k) - distances to k nearest neighbors
    """
    n_queries = len(query_points)
    n_ref = len(reference_points)
    
    all_indices = []
    all_distances = []
    
    for i in range(n_queries):
        # Compute distance to all reference points
        distances = []
        for j in range(n_ref):
            diff = query_points[i] - reference_points[j]
            dist = np.sqrt(np.sum(diff ** 2))
            distances.append(dist)
        
        # Sort and take top k
        distances = np.array(distances)
        sorted_indices = np.argsort(distances)
        
        all_indices.append(sorted_indices[:k])
        all_distances.append(distances[sorted_indices[:k]])
    
    return np.array(all_indices), np.array(all_distances)

# Test data
np.random.seed(42)
queries = np.random.randn(50, 32).astype(np.float32)
references = np.random.randn(200, 32).astype(np.float32)
k = 5

# Time the slow version
with Timer("KNN slow (50 queries, 200 references)"):
    indices_slow, distances_slow = knn_slow(queries, references, k)

print(f"\nResult shapes: indices={indices_slow.shape}, distances={distances_slow.shape}")

In [None]:
# YOUR OPTIMIZED VERSION HERE
def knn_fast(query_points, reference_points, k):
    """
    Optimized k-nearest neighbors.
    
    TODO: Implement this using vectorization!
    
    Hints:
    1. Use the ||a-b||^2 = ||a||^2 + ||b||^2 - 2*a¬∑b trick
    2. Compute all pairwise distances at once
    3. Use np.argpartition instead of full sort (faster for top-k)
    """
    # TODO: Your implementation here
    pass

# Uncomment when ready to test:
# with Timer("KNN fast"):
#     indices_fast, distances_fast = knn_fast(queries, references, k)
# 
# # Verify correctness
# print(f"Indices match: {np.allclose(indices_slow, indices_fast)}")
# print(f"Distances match: {np.allclose(distances_slow, distances_fast)}")

<details>
<summary>üí° Solution (click to reveal)</summary>

```python
def knn_fast(query_points, reference_points, k):
    # Compute squared norms
    query_sq = np.sum(query_points ** 2, axis=1, keepdims=True)  # (n_q, 1)
    ref_sq = np.sum(reference_points ** 2, axis=1)  # (n_r,)
    
    # All pairwise squared distances: ||q - r||^2 = ||q||^2 + ||r||^2 - 2*q¬∑r
    # query_sq: (n_q, 1), ref_sq: (n_r,) -> broadcasts to (n_q, n_r)
    # query_points @ reference_points.T: (n_q, n_r)
    sq_distances = query_sq + ref_sq - 2 * (query_points @ reference_points.T)
    sq_distances = np.maximum(sq_distances, 0)  # Numerical stability
    
    # Use argpartition for efficiency (doesn't fully sort)
    # This is O(n) instead of O(n log n) for full sort
    indices = np.argpartition(sq_distances, k, axis=1)[:, :k]
    
    # Get the actual distances for these indices
    row_indices = np.arange(len(query_points))[:, np.newaxis]
    top_k_sq_distances = sq_distances[row_indices, indices]
    
    # Sort within the k (argpartition doesn't sort)
    sorted_within_k = np.argsort(top_k_sq_distances, axis=1)
    final_indices = np.take_along_axis(indices, sorted_within_k, axis=1)
    final_distances = np.sqrt(np.take_along_axis(top_k_sq_distances, sorted_within_k, axis=1))
    
    return final_indices, final_distances
```

</details>

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Optimizing without measuring

In [None]:
# ‚ùå Wrong: Guessing what's slow
# "I bet this sqrt is slow, let me optimize it"

# ‚úÖ Right: Measure first
# Profile, find the ACTUAL bottleneck, then optimize

print("üí° Always profile before optimizing!")
print("   You might be surprised where time actually goes.")

### Mistake 2: Breaking correctness while optimizing

In [None]:
# ‚ùå Wrong: Optimize and forget to check results
# def fast_version(x):
#     # Some "clever" optimization that introduces bugs
#     pass

# ‚úÖ Right: Always verify against original
# assert np.allclose(fast_version(x), slow_version(x))

print("üí° Always compare optimized results against the original!")
print("   Speed means nothing if the answer is wrong.")

### Mistake 3: Over-optimizing small functions

In [None]:
# If a function takes 0.01% of total time, optimizing it 10x
# saves only 0.009% - not worth the complexity!

# Amdahl's Law:
# Speedup = 1 / ((1 - P) + P/S)
# P = fraction of time in optimized part
# S = speedup factor

def amdahl_speedup(fraction, speedup_factor):
    """Calculate overall speedup using Amdahl's Law."""
    return 1 / ((1 - fraction) + fraction / speedup_factor)

print("Amdahl's Law examples:")
print(f"  Optimize 90% of code 10x ‚Üí {amdahl_speedup(0.9, 10):.1f}x overall speedup")
print(f"  Optimize 10% of code 10x ‚Üí {amdahl_speedup(0.1, 10):.2f}x overall speedup")
print(f"\nüí° Focus on the biggest bottlenecks first!")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Timing code with context managers
- ‚úÖ Profiling with cProfile to find bottlenecks
- ‚úÖ Systematic optimization: measure ‚Üí optimize ‚Üí verify
- ‚úÖ Vectorization strategies for massive speedups
- ‚úÖ Memory considerations for large computations

---

## üìñ Further Reading

- [Python Profilers Documentation](https://docs.python.org/3/library/profile.html)
- [High Performance Python (Book)](https://www.oreilly.com/library/view/high-performance-python/9781492055020/)
- [NumPy Optimization Tips](https://numpy.org/doc/stable/user/basics.performance.html)

---

## üßπ Cleanup

In [None]:
import gc

# Clean up large arrays
del test_points, points_large
gc.collect()

print("‚úÖ Memory cleaned up!")
print("\n" + "="*50)
print("üéâ Congratulations! You've completed Module 1.2!")
print("="*50)
print("\nYou now have strong Python skills for AI/ML:")
print("  ‚úÖ NumPy broadcasting and vectorization")
print("  ‚úÖ Data preprocessing with Pandas")
print("  ‚úÖ Publication-quality visualizations")
print("  ‚úÖ Einsum for tensor operations")
print("  ‚úÖ Profiling and optimization")
print("\nNext up: Module 1.3 - CUDA Python & GPU Programming!")