# Lab 1.6.3: RAPIDS GPU Acceleration for Classical ML

**Module:** 1.6 - Classical ML Foundations  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê (Advanced)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand RAPIDS cuML and cuDF architecture
- [ ] Port scikit-learn pipelines to GPU with minimal code changes
- [ ] Benchmark CPU vs GPU performance on large datasets
- [ ] Know when GPU acceleration provides the biggest benefits
- [ ] Handle memory management for GPU DataFrames

---

## üìö Prerequisites

- Completed: Lab 1.6.1 and 1.6.2
- Knowledge of: scikit-learn API, pandas basics
- **Required**: RAPIDS installed (use NGC container)

---

## üåç Real-World Context

**The Data Science Time Problem:**

Data scientists spend most of their time waiting:
- Loading and preprocessing data: **30-40%** of time
- Training models: **20-30%** of time
- Hyperparameter tuning: **20-30%** of time

**RAPIDS changes everything:**

| Operation | CPU (sklearn) | GPU (cuML) | Speedup |
|-----------|--------------|------------|----------|
| Random Forest (1M rows) | 120 sec | 3 sec | **40x** |
| K-Means Clustering | 45 sec | 0.5 sec | **90x** |
| PCA | 30 sec | 0.3 sec | **100x** |
| DataFrame operations | 10 sec | 0.1 sec | **100x** |

**On your DGX Spark:**
- 128GB unified memory = huge datasets in GPU memory
- 6,144 CUDA cores = massive parallelism
- 192 (5th generation) Tensor Cores for accelerated compute
- ARM64/aarch64 architecture (use NGC containers, not pip)
- No CPU‚ÜîGPU transfers needed with unified memory!

---

## üßí ELI5: GPU Acceleration for ML

> **Imagine you need to count all the red M&Ms in a giant bowl...**
>
> **CPU approach** (scikit-learn):
> - You have 20 helpers (CPU cores)
> - Each helper picks up M&Ms one at a time
> - They're very smart and can handle complex tasks
> - But counting millions takes forever!
>
> **GPU approach** (RAPIDS cuML):
> - You have 6,144 helpers (CUDA cores)!
> - Each helper is simpler but checks M&Ms simultaneously
> - For simple, repetitive tasks = massively faster
>
> **When does GPU help most?**
> - Large datasets (more M&Ms = more parallelism)
> - Simple operations (counting vs. solving puzzles)
> - Many repetitive calculations (same task, different data)
>
> **In ML terms:** Matrix operations, distance calculations, and aggregations are perfect for GPUs because they do the same math on millions of data points.

---

## Part 1: Environment Setup

First, let's check our RAPIDS installation and DGX Spark capabilities.

**Note:** If you don't have RAPIDS installed, use the NGC container:
```bash
docker run --gpus all -it --rm \
    -v $HOME/workspace:/workspace \
    -v $HOME/.cache/huggingface:/root/.cache/huggingface \
    --ipc=host \
    nvcr.io/nvidia/rapidsai/base:25.11-py3 \
    jupyter lab --ip=0.0.0.0 --allow-root --no-browser
```

**Important:** On DGX Spark (ARM64), always use NGC containers. Never use `pip install torch` - PyTorch ARM64 wheels require the NGC container.

In [None]:
# Check DGX Spark GPU info
import subprocess

print("üñ•Ô∏è DGX Spark GPU Information")
print("=" * 60)
result = subprocess.run(['nvidia-smi', '--query-gpu=name,memory.total,memory.free,compute_cap', 
                        '--format=csv,noheader'], capture_output=True, text=True)
print(result.stdout)

# Check unified memory
print("\nüíæ Unified Memory Info:")
print("   DGX Spark uses unified memory - CPU and GPU share 128GB!")
print("   This means no explicit CPU‚ÜîGPU transfers needed.")

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import time
import warnings
warnings.filterwarnings('ignore')

# scikit-learn (CPU)
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier as SklearnRF
from sklearn.ensemble import RandomForestRegressor as SklearnRFReg
from sklearn.linear_model import LogisticRegression as SklearnLR
from sklearn.linear_model import Ridge as SklearnRidge
from sklearn.cluster import KMeans as SklearnKMeans
from sklearn.decomposition import PCA as SklearnPCA
from sklearn.neighbors import KNeighborsClassifier as SklearnKNN
from sklearn.preprocessing import StandardScaler as SklearnScaler
from sklearn.metrics import accuracy_score, mean_squared_error

print("‚úÖ scikit-learn (CPU) libraries imported!")

In [None]:
# Import RAPIDS cuML (GPU)
try:
    import cudf
    import cupy as cp
    from cuml.ensemble import RandomForestClassifier as CumlRF
    from cuml.ensemble import RandomForestRegressor as CumlRFReg
    from cuml.linear_model import LogisticRegression as CumlLR
    from cuml.linear_model import Ridge as CumlRidge
    from cuml.cluster import KMeans as CumlKMeans
    from cuml.decomposition import PCA as CumlPCA
    from cuml.neighbors import KNeighborsClassifier as CumlKNN
    from cuml.preprocessing import StandardScaler as CumlScaler
    
    RAPIDS_AVAILABLE = True
    print("‚úÖ RAPIDS cuML (GPU) libraries imported!")
    print(f"   cuDF version: {cudf.__version__}")
    
except ImportError as e:
    RAPIDS_AVAILABLE = False
    print("‚ùå RAPIDS not available. Please use the NGC container.")
    print(f"   Error: {e}")
    print("\n   To run RAPIDS on DGX Spark, use the NGC container:")
    print("   docker run --gpus all -it --rm \\")
    print("       -v $HOME/workspace:/workspace \\")
    print("       -v $HOME/.cache/huggingface:/root/.cache/huggingface \\")
    print("       --ipc=host \\")
    print("       nvcr.io/nvidia/rapidsai/base:25.11-py3 \\")
    print("       jupyter lab --ip=0.0.0.0 --allow-root --no-browser")

---

## Part 2: cuDF - GPU DataFrames

Before we benchmark ML algorithms, let's explore cuDF - GPU-accelerated DataFrames.

### üßí ELI5: cuDF vs pandas

> **pandas**: One person reading through a spreadsheet row by row
> **cuDF**: 6,144 people each reading one cell simultaneously!

In [None]:
if RAPIDS_AVAILABLE:
    # Create a large pandas DataFrame
    print("üìä Creating Large Dataset for DataFrame Benchmark...")
    n_rows = 5_000_000  # 5 million rows
    n_cols = 20
    
    # Generate data
    np.random.seed(42)
    data = np.random.randn(n_rows, n_cols).astype(np.float32)
    columns = [f'feature_{i}' for i in range(n_cols)]
    
    # Create pandas DataFrame
    pdf = pd.DataFrame(data, columns=columns)
    pdf['category'] = np.random.choice(['A', 'B', 'C', 'D'], size=n_rows)
    
    print(f"   Shape: {pdf.shape}")
    print(f"   Memory: {pdf.memory_usage(deep=True).sum() / 1e6:.1f} MB")
else:
    print("‚ö†Ô∏è RAPIDS not available - skipping GPU DataFrame demo")

In [None]:
if RAPIDS_AVAILABLE:
    # Benchmark DataFrame operations
    print("‚ö° Benchmarking DataFrame Operations")
    print("=" * 60)
    
    results = []
    
    # 1. GroupBy aggregation
    print("\n1Ô∏è‚É£ GroupBy Aggregation")
    
    # pandas
    start = time()
    pdf_result = pdf.groupby('category').agg({'feature_0': ['mean', 'std', 'min', 'max']})
    pandas_groupby_time = time() - start
    print(f"   pandas:  {pandas_groupby_time:.3f} seconds")
    
    # cuDF
    gdf = cudf.DataFrame.from_pandas(pdf)
    start = time()
    gdf_result = gdf.groupby('category').agg({'feature_0': ['mean', 'std', 'min', 'max']})
    cudf_groupby_time = time() - start
    print(f"   cuDF:    {cudf_groupby_time:.3f} seconds")
    print(f"   Speedup: {pandas_groupby_time/cudf_groupby_time:.1f}x")
    
    results.append(('GroupBy', pandas_groupby_time, cudf_groupby_time))
    
    # 2. Sorting
    print("\n2Ô∏è‚É£ Sorting")
    
    start = time()
    pdf_sorted = pdf.sort_values('feature_0')
    pandas_sort_time = time() - start
    print(f"   pandas:  {pandas_sort_time:.3f} seconds")
    
    start = time()
    gdf_sorted = gdf.sort_values('feature_0')
    cudf_sort_time = time() - start
    print(f"   cuDF:    {cudf_sort_time:.3f} seconds")
    print(f"   Speedup: {pandas_sort_time/cudf_sort_time:.1f}x")
    
    results.append(('Sorting', pandas_sort_time, cudf_sort_time))
    
    # 3. Arithmetic operations
    print("\n3Ô∏è‚É£ Arithmetic Operations")
    
    start = time()
    pdf['new_feature'] = pdf['feature_0'] * pdf['feature_1'] + pdf['feature_2'] ** 2
    pandas_arith_time = time() - start
    print(f"   pandas:  {pandas_arith_time:.3f} seconds")
    
    start = time()
    gdf['new_feature'] = gdf['feature_0'] * gdf['feature_1'] + gdf['feature_2'] ** 2
    cudf_arith_time = time() - start
    print(f"   cuDF:    {cudf_arith_time:.3f} seconds")
    print(f"   Speedup: {pandas_arith_time/cudf_arith_time:.1f}x")
    
    results.append(('Arithmetic', pandas_arith_time, cudf_arith_time))

In [None]:
if RAPIDS_AVAILABLE:
    # Visualize DataFrame benchmarks
    fig, ax = plt.subplots(figsize=(10, 6))
    
    operations = [r[0] for r in results]
    pandas_times = [r[1] for r in results]
    cudf_times = [r[2] for r in results]
    
    x = np.arange(len(operations))
    width = 0.35
    
    bars1 = ax.bar(x - width/2, pandas_times, width, label='pandas (CPU)', color='steelblue')
    bars2 = ax.bar(x + width/2, cudf_times, width, label='cuDF (GPU)', color='coral')
    
    ax.set_ylabel('Time (seconds)')
    ax.set_title('DataFrame Operations: pandas vs cuDF')
    ax.set_xticks(x)
    ax.set_xticklabels(operations)
    ax.legend()
    
    # Add speedup annotations
    for i, (p, c) in enumerate(zip(pandas_times, cudf_times)):
        speedup = p / c
        ax.annotate(f'{speedup:.0f}x', xy=(i + width/2, c), ha='center', va='bottom', 
                   fontsize=12, fontweight='bold', color='green')
    
    plt.tight_layout()
    plt.show()
    
    # Cleanup
    del pdf, gdf
    import gc
    gc.collect()

---

## Part 3: cuML - GPU Machine Learning

Now let's benchmark ML algorithms. We'll compare scikit-learn (CPU) with cuML (GPU).

### Key Insight: API Compatibility

cuML is designed as a **drop-in replacement** for scikit-learn:

```python
# scikit-learn (CPU)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# cuML (GPU) - SAME API!
from cuml.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
```

In [None]:
# Generate large classification dataset
print("üìä Generating Large Classification Dataset...")

n_samples = 1_000_000  # 1 million samples
n_features = 50

X, y = make_classification(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=30,
    n_redundant=10,
    n_classes=2,
    random_state=42
)

# Convert to float32 (GPU-friendly)
X = X.astype(np.float32)
y = y.astype(np.int32)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"   Samples: {n_samples:,}")
print(f"   Features: {n_features}")
print(f"   Training: {len(X_train):,}")
print(f"   Testing: {len(X_test):,}")
print(f"   Memory: {X.nbytes / 1e6:.1f} MB")

In [None]:
# Benchmark function
def benchmark_classifier(name, sklearn_cls, cuml_cls, sklearn_params, cuml_params,
                         X_train, X_test, y_train, y_test):
    """
    Benchmark scikit-learn vs cuML classifier.
    
    Returns dict with timing and accuracy results.
    """
    results = {'name': name}
    
    # scikit-learn (CPU)
    print(f"\nüîµ {name} - scikit-learn (CPU)")
    sklearn_model = sklearn_cls(**sklearn_params)
    
    start = time()
    sklearn_model.fit(X_train, y_train)
    results['sklearn_train'] = time() - start
    print(f"   Training: {results['sklearn_train']:.2f} seconds")
    
    start = time()
    sklearn_pred = sklearn_model.predict(X_test)
    results['sklearn_infer'] = time() - start
    results['sklearn_acc'] = accuracy_score(y_test, sklearn_pred)
    print(f"   Inference: {results['sklearn_infer']:.3f} seconds")
    print(f"   Accuracy: {results['sklearn_acc']:.4f}")
    
    # cuML (GPU)
    if RAPIDS_AVAILABLE:
        print(f"\nüü† {name} - cuML (GPU)")
        cuml_model = cuml_cls(**cuml_params)
        
        start = time()
        cuml_model.fit(X_train, y_train)
        results['cuml_train'] = time() - start
        print(f"   Training: {results['cuml_train']:.2f} seconds")
        
        start = time()
        cuml_pred = cuml_model.predict(X_test)
        if hasattr(cuml_pred, 'to_numpy'):
            cuml_pred = cuml_pred.to_numpy()
        results['cuml_infer'] = time() - start
        results['cuml_acc'] = accuracy_score(y_test, cuml_pred)
        print(f"   Inference: {results['cuml_infer']:.3f} seconds")
        print(f"   Accuracy: {results['cuml_acc']:.4f}")
        
        # Speedups
        results['train_speedup'] = results['sklearn_train'] / results['cuml_train']
        results['infer_speedup'] = results['sklearn_infer'] / results['cuml_infer']
        print(f"\n   ‚ö° Training Speedup: {results['train_speedup']:.1f}x")
        print(f"   ‚ö° Inference Speedup: {results['infer_speedup']:.1f}x")
    
    return results

In [None]:
# Benchmark 1: Random Forest
print("üå≤ Benchmark 1: Random Forest Classifier")
print("=" * 60)

rf_results = benchmark_classifier(
    name='Random Forest',
    sklearn_cls=SklearnRF,
    cuml_cls=CumlRF if RAPIDS_AVAILABLE else None,
    sklearn_params={'n_estimators': 100, 'max_depth': 16, 'n_jobs': -1, 'random_state': 42},
    cuml_params={'n_estimators': 100, 'max_depth': 16},
    X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test
)

In [None]:
# Benchmark 2: Logistic Regression
print("\nüìà Benchmark 2: Logistic Regression")
print("=" * 60)

lr_results = benchmark_classifier(
    name='Logistic Regression',
    sklearn_cls=SklearnLR,
    cuml_cls=CumlLR if RAPIDS_AVAILABLE else None,
    sklearn_params={'max_iter': 1000, 'n_jobs': -1},
    cuml_params={'max_iter': 1000},
    X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test
)

In [None]:
# Benchmark 3: K-Nearest Neighbors
# Note: KNN is very expensive on CPU for large datasets
print("\nüë• Benchmark 3: K-Nearest Neighbors")
print("=" * 60)
print("   (Using subset for CPU to avoid excessive wait time)")

# Use subset for sklearn to keep benchmark reasonable
X_train_knn = X_train[:100_000]
y_train_knn = y_train[:100_000]
X_test_knn = X_test[:20_000]
y_test_knn = y_test[:20_000]

knn_results = benchmark_classifier(
    name='K-Nearest Neighbors',
    sklearn_cls=SklearnKNN,
    cuml_cls=CumlKNN if RAPIDS_AVAILABLE else None,
    sklearn_params={'n_neighbors': 5, 'n_jobs': -1},
    cuml_params={'n_neighbors': 5},
    X_train=X_train_knn, X_test=X_test_knn, y_train=y_train_knn, y_test=y_test_knn
)

In [None]:
# Benchmark 4: K-Means Clustering
print("\nüéØ Benchmark 4: K-Means Clustering")
print("=" * 60)

def benchmark_clustering(name, sklearn_cls, cuml_cls, sklearn_params, cuml_params, X):
    results = {'name': name}
    
    # scikit-learn
    print(f"\nüîµ {name} - scikit-learn (CPU)")
    sklearn_model = sklearn_cls(**sklearn_params)
    start = time()
    sklearn_model.fit(X)
    results['sklearn_time'] = time() - start
    print(f"   Time: {results['sklearn_time']:.2f} seconds")
    print(f"   Inertia: {sklearn_model.inertia_:.2f}")
    
    # cuML
    if RAPIDS_AVAILABLE and cuml_cls:
        print(f"\nüü† {name} - cuML (GPU)")
        cuml_model = cuml_cls(**cuml_params)
        start = time()
        cuml_model.fit(X)
        results['cuml_time'] = time() - start
        print(f"   Time: {results['cuml_time']:.2f} seconds")
        print(f"   Inertia: {cuml_model.inertia_:.2f}")
        
        results['speedup'] = results['sklearn_time'] / results['cuml_time']
        print(f"\n   ‚ö° Speedup: {results['speedup']:.1f}x")
    
    return results

kmeans_results = benchmark_clustering(
    name='K-Means',
    sklearn_cls=SklearnKMeans,
    cuml_cls=CumlKMeans if RAPIDS_AVAILABLE else None,
    sklearn_params={'n_clusters': 10, 'n_init': 10, 'max_iter': 300, 'random_state': 42},
    cuml_params={'n_clusters': 10, 'n_init': 10, 'max_iter': 300},
    X=X_train
)

In [None]:
# Benchmark 5: PCA
print("\nüìâ Benchmark 5: Principal Component Analysis (PCA)")
print("=" * 60)

def benchmark_pca(name, sklearn_cls, cuml_cls, sklearn_params, cuml_params, X):
    results = {'name': name}
    
    # scikit-learn
    print(f"\nüîµ {name} - scikit-learn (CPU)")
    sklearn_model = sklearn_cls(**sklearn_params)
    start = time()
    X_transformed_sklearn = sklearn_model.fit_transform(X)
    results['sklearn_time'] = time() - start
    print(f"   Time: {results['sklearn_time']:.2f} seconds")
    print(f"   Explained variance ratio sum: {sklearn_model.explained_variance_ratio_.sum():.4f}")
    
    # cuML
    if RAPIDS_AVAILABLE and cuml_cls:
        print(f"\nüü† {name} - cuML (GPU)")
        cuml_model = cuml_cls(**cuml_params)
        start = time()
        X_transformed_cuml = cuml_model.fit_transform(X)
        results['cuml_time'] = time() - start
        print(f"   Time: {results['cuml_time']:.2f} seconds")
        print(f"   Explained variance ratio sum: {cuml_model.explained_variance_ratio_.sum():.4f}")
        
        results['speedup'] = results['sklearn_time'] / results['cuml_time']
        print(f"\n   ‚ö° Speedup: {results['speedup']:.1f}x")
    
    return results

pca_results = benchmark_pca(
    name='PCA',
    sklearn_cls=SklearnPCA,
    cuml_cls=CumlPCA if RAPIDS_AVAILABLE else None,
    sklearn_params={'n_components': 10},
    cuml_params={'n_components': 10},
    X=X_train
)

---

## Part 4: Summary Visualization

In [None]:
if RAPIDS_AVAILABLE:
    # Collect all results
    all_results = [
        ('Random Forest', rf_results.get('sklearn_train', 0), rf_results.get('cuml_train', 0.001)),
        ('Logistic Reg.', lr_results.get('sklearn_train', 0), lr_results.get('cuml_train', 0.001)),
        ('KNN', knn_results.get('sklearn_train', 0), knn_results.get('cuml_train', 0.001)),
        ('K-Means', kmeans_results.get('sklearn_time', 0), kmeans_results.get('cuml_time', 0.001)),
        ('PCA', pca_results.get('sklearn_time', 0), pca_results.get('cuml_time', 0.001)),
    ]
    
    # Create visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 1. Training time comparison
    ax1 = axes[0]
    names = [r[0] for r in all_results]
    sklearn_times = [r[1] for r in all_results]
    cuml_times = [r[2] for r in all_results]
    
    x = np.arange(len(names))
    width = 0.35
    
    bars1 = ax1.bar(x - width/2, sklearn_times, width, label='scikit-learn (CPU)', color='steelblue')
    bars2 = ax1.bar(x + width/2, cuml_times, width, label='cuML (GPU)', color='coral')
    
    ax1.set_ylabel('Time (seconds)')
    ax1.set_title('Training Time: scikit-learn vs cuML')
    ax1.set_xticks(x)
    ax1.set_xticklabels(names, rotation=15)
    ax1.legend()
    ax1.set_yscale('log')
    
    # 2. Speedup comparison
    ax2 = axes[1]
    speedups = [s/c if c > 0 else 0 for s, c in zip(sklearn_times, cuml_times)]
    colors = plt.cm.Greens(np.linspace(0.4, 0.8, len(speedups)))
    
    bars = ax2.bar(names, speedups, color=colors)
    ax2.axhline(y=1, color='red', linestyle='--', label='Break-even')
    ax2.set_ylabel('Speedup (x times faster)')
    ax2.set_title('GPU Speedup over CPU')
    ax2.set_xticklabels(names, rotation=15)
    
    # Add speedup annotations
    for bar, speedup in zip(bars, speedups):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                f'{speedup:.0f}x', ha='center', va='bottom', fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    plt.savefig('rapids_benchmark_summary.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("üíæ Saved rapids_benchmark_summary.png")

In [None]:
# Summary table
if RAPIDS_AVAILABLE:
    print("üìä Benchmark Summary Table")
    print("=" * 70)
    
    summary_data = {
        'Algorithm': names,
        'sklearn (CPU)': [f'{t:.2f}s' for t in sklearn_times],
        'cuML (GPU)': [f'{t:.2f}s' for t in cuml_times],
        'Speedup': [f'{s:.1f}x' for s in speedups]
    }
    
    summary_df = pd.DataFrame(summary_data)
    print(summary_df.to_string(index=False))
    
    avg_speedup = np.mean(speedups)
    print(f"\nüöÄ Average Speedup: {avg_speedup:.1f}x")
    print(f"   Dataset Size: {n_samples:,} samples √ó {n_features} features")

---

## Part 5: When to Use GPU Acceleration

GPU acceleration isn't always the right choice. Here's when it helps most:

In [None]:
# When to use GPU acceleration
print("üí° When to Use GPU Acceleration")
print("=" * 70)

guidance = """
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                    GPU ACCELERATION DECISION GUIDE                   ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                      ‚îÇ
‚îÇ  ‚úÖ USE GPU (cuML) WHEN:                                             ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                                              ‚îÇ
‚îÇ  ‚Ä¢ Dataset > 100K rows                                              ‚îÇ
‚îÇ  ‚Ä¢ Many features (> 50)                                             ‚îÇ
‚îÇ  ‚Ä¢ Algorithms: KNN, K-Means, PCA, Random Forest                     ‚îÇ
‚îÇ  ‚Ä¢ Iterative training (hyperparameter tuning)                       ‚îÇ
‚îÇ  ‚Ä¢ Real-time inference requirements                                 ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  ‚ùå STICK WITH CPU (sklearn) WHEN:                                   ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                                   ‚îÇ
‚îÇ  ‚Ä¢ Dataset < 10K rows (GPU overhead dominates)                      ‚îÇ
‚îÇ  ‚Ä¢ Simple models (linear regression on small data)                  ‚îÇ
‚îÇ  ‚Ä¢ Memory-limited (GPU memory is smaller than system RAM)           ‚îÇ
‚îÇ  ‚Ä¢ Debugging/prototyping (sklearn has better error messages)        ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  üí° DGX SPARK ADVANTAGE:                                             ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                                             ‚îÇ
‚îÇ  ‚Ä¢ 128GB unified memory = no CPU‚ÜîGPU transfers!                     ‚îÇ
‚îÇ  ‚Ä¢ Can fit huge datasets entirely in GPU memory                     ‚îÇ
‚îÇ  ‚Ä¢ Sweet spot: 100K-10M rows                                        ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îÇ  ‚ö° BIGGEST SPEEDUPS:                                                ‚îÇ
‚îÇ  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ                                              ‚îÇ
‚îÇ  ‚Ä¢ K-Nearest Neighbors: 50-100x (distance calculations)             ‚îÇ
‚îÇ  ‚Ä¢ K-Means: 50-100x (many iterations)                               ‚îÇ
‚îÇ  ‚Ä¢ PCA/SVD: 50-100x (matrix operations)                             ‚îÇ
‚îÇ  ‚Ä¢ Random Forest: 10-50x (tree building)                            ‚îÇ
‚îÇ  ‚Ä¢ Logistic Regression: 5-20x (iterative optimization)              ‚îÇ
‚îÇ                                                                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
"""
print(guidance)

---

## Part 6: Complete Pipeline Example

Let's create a complete ML pipeline using RAPIDS.

In [None]:
if RAPIDS_AVAILABLE:
    print("üîÑ Complete GPU-Accelerated ML Pipeline")
    print("=" * 60)
    
    # Start pipeline timer
    pipeline_start = time()
    
    # Step 1: Load data into GPU DataFrame
    print("\n1Ô∏è‚É£ Loading data into GPU...")
    X_train_gdf = cudf.DataFrame(X_train)
    X_test_gdf = cudf.DataFrame(X_test)
    y_train_gdf = cudf.Series(y_train)
    y_test_gdf = cudf.Series(y_test)
    print(f"   ‚úÖ Data loaded to GPU")
    
    # Step 2: Preprocessing - StandardScaler
    print("\n2Ô∏è‚É£ Scaling features (cuML StandardScaler)...")
    scaler = CumlScaler()
    X_train_scaled = scaler.fit_transform(X_train_gdf)
    X_test_scaled = scaler.transform(X_test_gdf)
    print(f"   ‚úÖ Features scaled")
    
    # Step 3: Dimensionality reduction - PCA
    print("\n3Ô∏è‚É£ Reducing dimensions (cuML PCA)...")
    pca = CumlPCA(n_components=20)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)
    print(f"   ‚úÖ Reduced to 20 components (variance retained: {pca.explained_variance_ratio_.sum():.2%})")
    
    # Step 4: Train Random Forest
    print("\n4Ô∏è‚É£ Training Random Forest (cuML)...")
    rf = CumlRF(n_estimators=100, max_depth=16)
    rf.fit(X_train_pca, y_train_gdf)
    print(f"   ‚úÖ Model trained")
    
    # Step 5: Predictions
    print("\n5Ô∏è‚É£ Making predictions...")
    y_pred = rf.predict(X_test_pca)
    y_pred_np = y_pred.to_numpy() if hasattr(y_pred, 'to_numpy') else np.array(y_pred)
    
    # Step 6: Evaluate
    accuracy = accuracy_score(y_test, y_pred_np)
    
    pipeline_time = time() - pipeline_start
    
    print(f"\n‚úÖ Pipeline Complete!")
    print(f"   Total Time: {pipeline_time:.2f} seconds")
    print(f"   Accuracy: {accuracy:.4f}")
    print(f"\n   This entire pipeline ran on GPU!")

---

## ‚úã Try It Yourself

### Exercise 1: Benchmark on Different Dataset Sizes

How does the speedup change with dataset size?

<details>
<summary>üí° Hint</summary>
Try sizes: 10K, 100K, 500K, 1M, 5M. Plot speedup vs dataset size.
</details>

In [None]:
# Exercise 1: Your code here
# Benchmark different dataset sizes and plot speedup curve

# sizes = [10_000, 100_000, 500_000, 1_000_000]
# speedups = []
# 
# for size in sizes:
#     X, y = make_classification(n_samples=size, ...)
#     # ... benchmark sklearn vs cuml ...
#     speedups.append(sklearn_time / cuml_time)
# 
# plt.plot(sizes, speedups, 'o-')
# plt.xlabel('Dataset Size')
# plt.ylabel('Speedup (x)')

### Exercise 2: Port a Full sklearn Pipeline

Convert this sklearn pipeline to cuML:

```python
from sklearn.pipeline import Pipeline

sklearn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=20)),
    ('classifier', RandomForestClassifier(n_estimators=100))
])
```

<details>
<summary>üí° Hint</summary>
cuML has a Pipeline class too, or you can chain the operations manually.
</details>

### üí° cuML Pipeline API

cuML provides a Pipeline class similar to scikit-learn:

```python
# Import cuML Pipeline
from cuml.pipeline import Pipeline as CumlPipeline
from cuml.preprocessing import StandardScaler as CumlScaler
from cuml.decomposition import PCA as CumlPCA
from cuml.ensemble import RandomForestClassifier as CumlRF

# Create a GPU-accelerated pipeline
cuml_pipe = CumlPipeline([
    ('scaler', CumlScaler()),
    ('pca', CumlPCA(n_components=20)),
    ('classifier', CumlRF(n_estimators=100))
])

# Use like sklearn Pipeline
cuml_pipe.fit(X_train_cudf, y_train_cudf)
predictions = cuml_pipe.predict(X_test_cudf)
```

**Key notes:**
- Input should be cuDF DataFrames or cupy arrays
- All transformers in the pipeline run on GPU
- Output is typically a cupy array (use `.to_numpy()` if needed)

In [None]:
# Exercise 2: Your code here
# Port the sklearn pipeline to cuML

# from cuml.pipeline import Pipeline as CumlPipeline
# 
# cuml_pipe = CumlPipeline([
#     ('scaler', CumlScaler()),
#     ('pca', CumlPCA(n_components=20)),
#     ('classifier', CumlRF(n_estimators=100))
# ])

### Exercise 3: Memory Profiling

Monitor GPU memory usage during training.

<details>
<summary>üí° Hint</summary>
Use `nvidia-smi` or `cupy.get_default_memory_pool().used_bytes()` to monitor memory.
</details>

### üí° GPU Memory Profiling with CuPy

CuPy provides memory pool utilities to monitor GPU memory usage:

```python
import cupy as cp

# Get current GPU memory usage
def get_gpu_memory_gb():
    """Returns GPU memory used by CuPy in GB."""
    return cp.get_default_memory_pool().used_bytes() / 1e9

# Check memory before/after operations
print(f"Before training: {get_gpu_memory_gb():.2f} GB")
model.fit(X_train, y_train)
print(f"After training: {get_gpu_memory_gb():.2f} GB")

# Free unused GPU memory
cp.get_default_memory_pool().free_all_blocks()
print(f"After cleanup: {get_gpu_memory_gb():.2f} GB")
```

**Memory pool methods:**
- `used_bytes()` - Currently allocated memory
- `total_bytes()` - Total memory managed by the pool
- `free_all_blocks()` - Release unused memory back to GPU

In [None]:
# Exercise 3: Your code here
# Monitor GPU memory usage

# import cupy as cp
# 
# def get_gpu_memory():
#     return cp.get_default_memory_pool().used_bytes() / 1e9
# 
# print(f"Before: {get_gpu_memory():.2f} GB")
# # ... train model ...
# print(f"After: {get_gpu_memory():.2f} GB")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting to Convert Data Types

In [None]:
# ‚ùå Wrong: Using float64 (wastes GPU memory and is slower)
# X = X.astype(np.float64)

# ‚úÖ Right: Use float32 for GPU
# X = X.astype(np.float32)

print("üí° Always use float32 for GPU operations!")
print("   float64 ‚Üí float32 can halve memory usage and improve performance.")
print("   cuML often requires float32 anyway.")

### Mistake 2: Not Cleaning Up GPU Memory

In [None]:
# ‚ùå Wrong: Letting GPU memory accumulate
# for i in range(100):
#     model = CumlRF()
#     model.fit(X, y)  # Memory keeps growing!

# ‚úÖ Right: Clean up after each iteration
# import gc
# import cupy as cp
# 
# for i in range(100):
#     model = CumlRF()
#     model.fit(X, y)
#     del model
#     gc.collect()
#     cp.get_default_memory_pool().free_all_blocks()

print("üí° Clean up GPU memory in loops!")
print("   Use: del model; gc.collect(); cp.get_default_memory_pool().free_all_blocks()")

### Mistake 3: Unnecessary CPU‚ÜîGPU Transfers

In [None]:
# ‚ùå Wrong: Converting back and forth
# gdf = cudf.from_pandas(pdf)
# result = gdf.groupby('col').sum()
# pdf_result = result.to_pandas()  # Unnecessary transfer!
# another_result = cudf.from_pandas(pdf_result)  # Back to GPU??

# ‚úÖ Right: Stay on GPU as long as possible
# gdf = cudf.from_pandas(pdf)  # Transfer once
# result = gdf.groupby('col').sum()  # Stay on GPU
# final = result.merge(other_gdf)  # Still on GPU
# pdf_final = final.to_pandas()  # Transfer at end only

print("üí° Minimize CPU‚ÜîGPU transfers!")
print("   Transfer to GPU once at the start.")
print("   Do all processing on GPU.")
print("   Transfer back to CPU only at the end.")
print("\n   DGX Spark's unified memory helps, but avoiding transfers is still faster!")

---

## üéâ Checkpoint

Congratulations! You've mastered GPU acceleration for classical ML. You've learned:

- ‚úÖ **cuDF basics**: GPU-accelerated DataFrames with pandas-like API
- ‚úÖ **cuML algorithms**: Drop-in sklearn replacements running on GPU
- ‚úÖ **Benchmarking**: 10-100x speedups on large datasets
- ‚úÖ **When to use GPU**: Large datasets, many iterations, distance calculations
- ‚úÖ **Best practices**: float32, memory cleanup, minimize transfers

---

## üöÄ Challenge (Optional)

**The Big Data Challenge:**

1. Download the Higgs Boson dataset (11M samples): https://archive.ics.uci.edu/dataset/280/higgs
2. Try loading it with pandas (will be slow!) vs cuDF
3. Train a Random Forest classifier on the full dataset
4. Report your speedups!

This is a real-world ML challenge that would take hours on CPU but minutes on GPU.

---

## üìñ Further Reading

- [RAPIDS cuML Documentation](https://docs.rapids.ai/api/cuml/stable/)
- [RAPIDS cuDF Documentation](https://docs.rapids.ai/api/cudf/stable/)
- [RAPIDS AI Getting Started](https://rapids.ai/start.html)
- [cuML vs scikit-learn Benchmarks](https://medium.com/rapids-ai/)

---

## üßπ Cleanup

In [None]:
# Clean up GPU memory
import gc

# Delete large arrays
del X, y, X_train, X_test, y_train, y_test

if RAPIDS_AVAILABLE:
    import cupy as cp
    # Free GPU memory
    gc.collect()
    cp.get_default_memory_pool().free_all_blocks()

gc.collect()

print("‚úÖ Memory cleaned up!")

---

## ‚û°Ô∏è Next Steps

Continue to **Lab 1.6.4: Baseline Comparison Framework** to create a reusable framework for comparing models!