# Bud Flow Lang - Introduction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/anthropics/bud_flow_lang/blob/main/docs/notebooks/01_introduction.ipynb)

Welcome to Bud Flow Lang! This notebook will introduce you to high-performance SIMD array computing in Python.

## What You'll Learn
1. How to install Bud Flow Lang
2. Creating and manipulating arrays
3. Basic arithmetic operations
4. Reductions and dot products
5. Working with NumPy

## Step 1: Installation

Run this cell to install Bud Flow Lang from source. This takes about 2-3 minutes.

In [None]:
# Install build dependencies
!apt-get update -qq && apt-get install -qq -y cmake g++ > /dev/null

# Clone and build
!if [ ! -d "bud_flow_lang" ]; then \
    git clone --depth 1 https://github.com/anthropics/bud_flow_lang.git && \
    cd bud_flow_lang && mkdir -p build && cd build && \
    cmake .. -DCMAKE_BUILD_TYPE=Release -DBUD_BUILD_PYTHON=ON > /dev/null && \
    make -j4 2>&1 | tail -5; \
fi

print("Build complete!")

In [None]:
# Add to Python path and import
import sys
sys.path.insert(0, '/content/bud_flow_lang/build')

import bud_flow_lang_py as flow
import numpy as np

# Initialize the runtime
flow.initialize()
print("Bud Flow Lang is ready!")

## Step 2: Check Your Hardware

Let's see what SIMD capabilities your CPU has:

In [None]:
# Get hardware information
info = flow.get_hardware_info()

print("=" * 50)
print("HARDWARE INFORMATION")
print("=" * 50)
print(f"Architecture: {info['arch_family']} ({'64-bit' if info['is_64bit'] else '32-bit'})")
print(f"SIMD Width: {info['simd_width']} bytes ({info['simd_width']*8} bits)")
print(f"CPU Cores: {info['physical_cores']} physical, {info['logical_cores']} logical")
print()
print("SIMD Support:")
print(f"  SSE2:    {info['has_sse2']}")
print(f"  AVX:     {info['has_avx']}")
print(f"  AVX2:    {info['has_avx2']}")
print(f"  AVX-512: {info['has_avx512']}")
print(f"  NEON:    {info['has_neon']}")
print(f"  SVE:     {info['has_sve']}")

In [None]:
# Check cache configuration
cache = flow.detect_cache_config()

print("\nCACHE CONFIGURATION")
print("=" * 50)
print(f"L1 Cache: {cache['l1_size_kb']} KB")
print(f"L2 Cache: {cache['l2_size_kb']} KB")
print(f"L3 Cache: {cache['l3_size_kb']} KB")
print(f"Cache Line: {cache['line_size']} bytes")

## Step 3: Creating Arrays

Bud Flow Lang provides several ways to create arrays:

In [None]:
# Create arrays filled with values
zeros = flow.zeros(10)
ones = flow.ones(10)
filled = flow.full(10, 3.14)

print("zeros(10):  ", zeros.to_numpy())
print("ones(10):   ", ones.to_numpy())
print("full(10, Ï€):", filled.to_numpy())

In [None]:
# Create sequences
range_arr = flow.arange(10)              # [0, 1, 2, ..., 9]
range_step = flow.arange(5, 0.0, 2.0)    # 5 elements: [0, 2, 4, 6, 8]
linear = flow.linspace(0.0, 1.0, 5)

print("arange(10):          ", range_arr.to_numpy())
print("arange(5, 0.0, 2.0): ", range_step.to_numpy())
print("linspace(0, 1, 5):   ", linear.to_numpy())

In [None]:
# Create from NumPy arrays
np_arr = np.array([1.0, 2.0, 3.0, 4.0, 5.0], dtype=np.float32)
flow_arr = flow.flow(np_arr)

print("From NumPy:", flow_arr.to_numpy())

# Create from Python list
flow_list = flow.flow([10.0, 20.0, 30.0])
print("From list: ", flow_list.to_numpy())

In [None]:
# Array properties
a = flow.ones(1000)

print(f"Size:     {a.size} elements")
print(f"Shape:    {a.shape}")
print(f"Dtype:    {a.dtype}")
print(f"Bytes:    {a.nbytes}")
print(f"Itemsize: {a.itemsize} bytes per element")

## Step 4: Arithmetic Operations

All operations are SIMD-accelerated:

In [None]:
a = flow.arange(5)     # [0, 1, 2, 3, 4]
b = flow.full(5, 2.0)  # [2, 2, 2, 2, 2]

print("a =", a.to_numpy())
print("b =", b.to_numpy())
print()

# Element-wise operations
print("a + b =", (a + b).to_numpy())
print("a - b =", (a - b).to_numpy())
print("a * b =", (a * b).to_numpy())
print("a / b =", (a / b).to_numpy())

In [None]:
# Unary operations
x = flow.flow([1.0, 4.0, 9.0, 16.0])
print("x =", x.to_numpy())
print("sqrt(x) =", flow.sqrt(x).to_numpy())
print("abs(-x) =", flow.abs(-x).to_numpy())

## Step 5: Reductions

Compute aggregate values over arrays:

In [None]:
a = flow.arange(10)  # [0, 1, 2, ..., 9]

print(f"Array: {a.to_numpy()}")
print(f"Sum:   {a.sum()}")    # 45
print(f"Min:   {a.min()}")    # 0
print(f"Max:   {a.max()}")    # 9
print(f"Mean:  {a.mean()}")   # 4.5

## Step 6: Dot Product

Compute the inner product of two vectors:

In [None]:
a = flow.ones(1000)
b = flow.full(1000, 2.0)

# Two equivalent ways
dot1 = flow.dot(a, b)
dot2 = a.dot(b)

print(f"dot(ones, twos) = {dot1}")
print(f"Expected: {1000 * 1 * 2} = 2000")

## Step 7: Fused Multiply-Add (FMA)

FMA computes `a * b + c` in a single operation. This is:
1. **Faster** - one pass over memory instead of two
2. **More accurate** - single rounding instead of two

In [None]:
a = flow.ones(1000000)       # 1 million elements
b = flow.full(1000000, 2.0)
c = flow.full(1000000, 3.0)

# FMA: a * b + c = 1 * 2 + 3 = 5
result = flow.fma(a, b, c)

print(f"FMA result (first 5): {result.to_numpy()[:5]}")
print(f"Sum: {result.sum()} (expected: 5,000,000)")

In [None]:
import time

# Compare FMA vs separate operations
n = 1000000
a = flow.ones(n)
b = flow.full(n, 2.0)
c = flow.full(n, 3.0)

# Warmup
_ = flow.fma(a, b, c)
_ = a * b + c

# Benchmark FMA
start = time.perf_counter()
for _ in range(100):
    result = flow.fma(a, b, c)
fma_time = (time.perf_counter() - start) / 100 * 1000

# Benchmark separate ops
start = time.perf_counter()
for _ in range(100):
    result = a * b + c
sep_time = (time.perf_counter() - start) / 100 * 1000

print(f"FMA:      {fma_time:.3f} ms")
print(f"a*b + c:  {sep_time:.3f} ms")
print(f"Speedup:  {sep_time/fma_time:.2f}x")

## Step 8: NumPy Interoperability

Flow works seamlessly with NumPy:

In [None]:
# Create NumPy array
np_data = np.random.randn(1000).astype(np.float32)

# Convert to Flow
flow_data = flow.flow(np_data)

# Process with Flow
flow_result = flow_data * 2 + 1

# Convert back to NumPy
np_result = flow_result.to_numpy()

# Verify
np_expected = np_data * 2 + 1
print(f"Results match: {np.allclose(np_result, np_expected)}")

## Step 9: Performance Comparison

Let's compare Flow vs NumPy performance:

In [None]:
import time

def benchmark(name, flow_func, np_func, iterations=50):
    """Benchmark Flow vs NumPy."""
    # Warmup
    for _ in range(10):
        flow_func()
        np_func()
    
    # Flow timing
    start = time.perf_counter()
    for _ in range(iterations):
        flow_func()
    flow_time = (time.perf_counter() - start) / iterations * 1000
    
    # NumPy timing
    start = time.perf_counter()
    for _ in range(iterations):
        np_func()
    np_time = (time.perf_counter() - start) / iterations * 1000
    
    speedup = np_time / flow_time
    print(f"{name:20} Flow={flow_time:.3f}ms  NumPy={np_time:.3f}ms  Ratio={speedup:.2f}x")
    return flow_time, np_time

# Setup arrays
n = 1_000_000
np_a = np.ones(n, dtype=np.float32)
np_b = np.ones(n, dtype=np.float32)
np_c = np.ones(n, dtype=np.float32)
flow_a = flow.ones(n)
flow_b = flow.ones(n)
flow_c = flow.ones(n)

print(f"Benchmarking with {n:,} elements")
print("=" * 65)

benchmark("Sum", 
          lambda: flow_a.sum(), 
          lambda: np_a.sum())

benchmark("Dot Product", 
          lambda: flow.dot(flow_a, flow_b), 
          lambda: np.dot(np_a, np_b))

benchmark("Element-wise Add", 
          lambda: flow_a + flow_b, 
          lambda: np_a + np_b)

benchmark("FMA (a*b+c)", 
          lambda: flow.fma(flow_a, flow_b, flow_c), 
          lambda: np_a * np_b + np_c)

## Key Takeaways

1. **FMA is Flow's strength** - True hardware fused multiply-add beats NumPy
2. **Element-wise operations are competitive** - Similar performance to NumPy
3. **Reductions favor NumPy** - NumPy uses multi-threaded BLAS
4. **Seamless NumPy interop** - Easy to mix and match

## Next Steps

- **[NumPy Comparison](02_numpy_comparison.ipynb)** - Detailed performance analysis
- **[Advanced Features](03_advanced.ipynb)** - Memory optimization, JIT compilation
- **[API Reference](../api/core.md)** - Complete function documentation

In [None]:
# Clean up
flow.shutdown()
print("Done!")