<a href="https://colab.research.google.com/github/2303A51553/HPC/blob/main/HPC_lab_03NUMA_Effects_and_Bandwidth_Measurement_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NUMA (Non-Uniform Memory Access) means:

Memory access time depends on which CPU socket accesses which memory node.

*   Local memory access → faster
*   Remote memory access → slower





In [1]:
!sudo apt install numactl python3-numpy

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
  python-numpy-doc python3-dev python3-pytest
The following NEW packages will be installed:
  numactl python3-numpy
0 upgraded, 2 newly installed, 0 to remove and 1 not upgraded.
Need to get 3,504 kB of archives.
After this operation, 19.1 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 numactl amd64 2.0.14-3ubuntu2 [36.8 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 python3-numpy amd64 1:1.21.5-1ubuntu22.04.1 [3,467 kB]
Fetched 3,504 kB in 1s (2,659 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 2.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controll

In [2]:
!lscpu

Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             46 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      2
  On-line CPU(s) list:       0,1
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:              6
    Model:                   79
    Thread(s) per core:      2
    Core(s) per socket:      1
    Socket(s):               1
    Stepping:                0
    BogoMIPS:                4399.99
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pg
                             e mca cmov pat pse36 clflush mmx fxsr sse sse2 ss h
                             t syscall nx pdpe1gb rdtscp lm constant_tsc rep_goo
                             d nopl xtopology nonstop_tsc cpuid tsc_known_freq p
                             ni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2ap
                   

In [3]:
!numactl --hardware

available: 1 nodes (0)
node 0 cpus: 0 1
node 0 size: 12975 MB
node 0 free: 8621 MB
node distances:
node   0 
  0:  10 


# Memory Bandwidth Measurement

- Allocates a large NumPy array

- Repeatedly reads & writes it

- Measures effective memory bandwidth

- Shows performance difference when memory is local vs remote


# NUMA Bandwidth Benchmark

In [4]:
import numpy as np
import time
import os

def memory_bandwidth_test(array_size_gb=1, iterations=5):
    """
    Measure memory bandwidth by reading and writing a large array.
    """
    print(f"PID: {os.getpid()}")
    print(f"Allocating {array_size_gb} GB array")

    num_elements = int((array_size_gb * 1024**3) / 8)  # float64 = 8 bytes
    a = np.ones(num_elements, dtype=np.float64)
    b = np.zeros(num_elements, dtype=np.float64)

    # Warm-up
    b[:] = a[:]

    start = time.time()
    for _ in range(iterations):
        b[:] = a[:] * 2.0
    end = time.time()

    elapsed = end - start
    total_bytes = array_size_gb * 1024**3 * iterations * 2  # read + write

    bandwidth_gbps = total_bytes / elapsed / (1024**3)

    print(f"Time taken: {elapsed:.2f} seconds")
    print(f"Effective Bandwidth: {bandwidth_gbps:.2f} GB/s")

if __name__ == "__main__":
    memory_bandwidth_test(array_size_gb=1, iterations=5)


PID: 292
Allocating 1 GB array
Time taken: 3.00 seconds
Effective Bandwidth: 3.33 GB/s


# NUMA Effects with Threads

In [5]:
import numpy as np
import time

def memory_bandwidth_colab(size_mb=512, iterations=10):
    size_bytes = size_mb * 1024 * 1024
    num_elements = size_bytes // 8  # float64

    print(f"Array size: {size_mb} MB")

    a = np.ones(num_elements, dtype=np.float64)
    b = np.zeros(num_elements, dtype=np.float64)

    # Warm-up
    b[:] = a[:]

    start = time.time()
    for _ in range(iterations):
        b[:] = a[:] * 2.0
    end = time.time()

    elapsed = end - start
    total_bytes = size_bytes * iterations * 2  # read + write
    bandwidth = total_bytes / elapsed / (1024**3)

    print(f"Time taken: {elapsed:.2f} seconds")
    print(f"Effective Memory Bandwidth: {bandwidth:.2f} GB/s\n")

# Try different sizes
for size in [128, 256, 512, 1024, 2048]:
    memory_bandwidth_colab(size_mb=size)


Array size: 128 MB
Time taken: 0.78 seconds
Effective Memory Bandwidth: 3.21 GB/s

Array size: 256 MB
Time taken: 1.58 seconds
Effective Memory Bandwidth: 3.16 GB/s

Array size: 512 MB
Time taken: 2.86 seconds
Effective Memory Bandwidth: 3.49 GB/s

Array size: 1024 MB
Time taken: 6.02 seconds
Effective Memory Bandwidth: 3.32 GB/s

Array size: 2048 MB
Time taken: 11.87 seconds
Effective Memory Bandwidth: 3.37 GB/s



In [6]:
import threading
import numpy as np
import time

def worker(arr):
    for _ in range(10):
        arr[:] += 1

size = 200_000_000  # ~1.6 GB
array = np.zeros(size, dtype=np.float64)

threads = []
start = time.time()

for _ in range(4):
    t = threading.Thread(target=worker, args=(array,))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

end = time.time()
print("Elapsed time:", end - start)


Elapsed time: 8.79379653930664


NUMA-aware programs must allocate memory close to the CPU that uses it.
Ignoring NUMA leads to significant performance loss even without code errors.