<a href="https://colab.research.google.com/github/2303a51295madhuri/HPC-LAB/blob/main/Assignment_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Vector Addition (Scalar vs SIMD-like)**

In [1]:
import numpy as np
import time

N = 10_000_000

A = np.arange(N, dtype=np.float64)
B = np.arange(N, dtype=np.float64)

# Scalar loop
C = np.zeros(N)
start = time.time()
for i in range(N):
    C[i] = A[i] + B[i]
end = time.time()
print("Normal loop time:", end - start)

# Vectorized (SIMD-like)
start = time.time()
C = A + B
end = time.time()
print("Vectorized time:", end - start)


Normal loop time: 7.10044527053833
Vectorized time: 0.04449152946472168


EXPLANATION

The first part uses a normal Python for loop to add arrays element-by-element, which is slow because Python executes each iteration sequentially.

The second part uses C = A + B, a NumPy vectorized operation that performs all additions at once using optimized low-level (SIMD-like) instructions.

Therefore, the vectorized version runs much faster and demonstrates the performance benefit of data-level parallelism.

**Reduction (Sum)**


In [2]:
import numpy as np
import time

N = 10_000_000
A = np.ones(N, dtype=np.float64)

# Normal loop
start = time.time()
s = 0.0
for i in range(N):
    s += A[i]
end = time.time()
print("Normal sum:", s, "Time:", end - start)

# Vectorized reduction
start = time.time()
s = np.sum(A)
end = time.time()
print("Vectorized sum:", s, "Time:", end - start)


Normal sum: 10000000.0 Time: 2.6535258293151855
Vectorized sum: 10000000.0 Time: 0.007744550704956055


EXPLANATION

1. This program creates a large array of 10 million elements filled with ones using NumPy.
2. The first method calculates the sum using a normal Python loop, which is slow because each addition is executed sequentially.
3. The second method uses np.sum(A), a vectorized reduction that performs the operation internally using optimized SIMD-like instructions, making it much faster.

**Memory Alignment Effect**

In [3]:
import numpy as np
import time

N = 10_000_000

unaligned = np.arange(N + 1, dtype=np.float64)[1:]
aligned = np.arange(N, dtype=np.float64)

start = time.time()
np.sum(unaligned)
print("Unaligned time:", time.time() - start)

start = time.time()
np.sum(aligned)
print("Aligned time:", time.time() - start)


Unaligned time: 0.008980274200439453
Aligned time: 0.007965087890625


Explanation

The program creates two arrays – one properly memory aligned and one slightly shifted (unaligned) by slicing.

It measures the time taken to compute the sum of both arrays using np.sum().

The aligned array executes faster because SIMD operations work more efficiently on properly aligned memory.

**Parallel + SIMD (Implicit)**

In [4]:
import numpy as np
import time

N = 10_000_000
A = np.arange(N, dtype=np.float64)

start = time.time()
B = A * 2.0
print("Vectorized (SIMD + multithreaded) time:", time.time() - start)


Vectorized (SIMD + multithreaded) time: 0.026386737823486328


Explanation

The program creates a large NumPy array of 10 million floating-point numbers.

The operation B = A * 2.0 performs element-wise multiplication using NumPy’s vectorized implementation.

This computation runs very fast because NumPy internally uses SIMD instructions and multithreading for parallel execution.

**Branch Divergence**

In [5]:
import numpy as np
import time

N = 10_000_000
A = np.random.rand(N) * 100
B = np.zeros(N)

start = time.time()
for i in range(N):
    if A[i] > 50:
        B[i] = A[i] * 2
    else:
        B[i] = A[i] / 2
print("Branch loop time:", time.time() - start)


Branch loop time: 6.728229284286499


The program creates a large array with random values and processes each element using a normal Python loop with an if-else condition.

Because every iteration contains a branch decision, the CPU cannot effectively use SIMD vectorization.

This branch-heavy loop runs slowly compared to a vectorized implementation that avoids conditional branching.

In [6]:
start = time.time()
B = np.where(A > 50, A * 2, A / 2)
print("Vectorized conditional time:", time.time() - start)


Vectorized conditional time: 0.17036724090576172


This code replaces the slow loop with a fully vectorized operation using np.where.

The condition and computations are applied to the entire array at once instead of element-by-element.

As a result, it runs much faster because NumPy performs the operation using optimized SIMD-style execution without branching.