# FlashAttention-2 Multi-GPU Test

Tests tensor parallel attention across multiple GPUs.

**Note:** Colab free tier has 1 GPU. For true multi-GPU testing, use:
- Colab Pro (A100 x2)
- Kaggle (T4 x2)
- Cloud VMs with multiple GPUs

## 1. Environment Setup

In [None]:
import subprocess

# Check available GPUs
result = subprocess.run(['nvidia-smi', '-L'], capture_output=True, text=True)
print("Available GPUs:")
print(result.stdout)

# Count GPUs
gpu_count = result.stdout.count('GPU ')
print(f"Total GPUs: {gpu_count}")

In [None]:
# Install NCCL (required for multi-GPU)
!apt-get update -qq
!apt-get install -y -qq libnccl2 libnccl-dev
print("NCCL installed!")

In [None]:
# Clone repository
!rm -rf ollama-api-gateway
!git clone --depth 1 https://github.com/umerkhan95/ollama-api-gateway.git
%cd ollama-api-gateway/mojo-gateway

## 2. Build Multi-GPU Kernels

In [None]:
%cd src/kernels/cuda
!make clean

# Build FlashAttention-2 (single GPU)
!make CUDA_ARCH="-gencode arch=compute_75,code=sm_75" \
      NVCC_FLAGS_COMMON="-O3 -Xcompiler -fPIC -Xcompiler -Wall" \
      fa2

print("\nSingle-GPU FA2 built!")

In [None]:
# Build Multi-GPU FlashAttention-2
!make CUDA_ARCH="-gencode arch=compute_75,code=sm_75" \
      NVCC_FLAGS_COMMON="-O3 -Xcompiler -fPIC -Xcompiler -Wall" \
      fa2-multi-gpu

print("\nMulti-GPU FA2 built!")
!ls -la ../../../lib/*.so

## 3. Compile and Run Multi-GPU Test

In [None]:
# Compile test binary
!nvcc -O3 -gencode arch=compute_75,code=sm_75 \
    -o test_multi_gpu test_flash_attention_v2_multi_gpu.cu \
    flash_attention_v2.o flash_attention_v2_multi_gpu.o \
    -lnccl -lpthread -lcudart

print("Test binary compiled!")

In [None]:
# Run multi-GPU benchmark
!./test_multi_gpu

## 4. Single GPU Baseline Comparison

In [None]:
# Compare single-GPU vs multi-GPU (if multiple GPUs available)
import subprocess
result = subprocess.run(['nvidia-smi', '-L'], capture_output=True, text=True)
gpu_count = result.stdout.count('GPU ')

if gpu_count == 1:
    print("="*60)
    print("  SINGLE GPU ENVIRONMENT DETECTED")
    print("="*60)
    print()
    print("Multi-GPU tensor parallelism cannot be fully tested.")
    print("The test above validates:")
    print("  - Code compiles with NCCL")
    print("  - Single-GPU initialization works")
    print("  - API functions are callable")
    print()
    print("For true multi-GPU testing, use:")
    print("  - Colab Pro (A100 x2)")
    print("  - Kaggle (T4 x2)")
    print("  - AWS p3.8xlarge (V100 x4)")
    print("  - GCP a2-highgpu-4g (A100 x4)")
else:
    print(f"Multi-GPU environment: {gpu_count} GPUs")
    print("Full tensor parallel testing available!")

## 5. Expected Performance

Based on tensor parallelism theory and NCCL overhead:

| GPUs | Speedup | Throughput (est.) |
|------|---------|-------------------|
| 1    | 1.0x    | 708 tok/s         |
| 2    | 1.7-2.0x| 1200-1400 tok/s   |
| 4    | 2.5-3.1x| 1770-2200 tok/s   |
| 8    | 3.5-4.0x| 2500-2800 tok/s   |

In [None]:
# Cleanup
!rm -f test_multi_gpu
print("Cleanup complete!")