<a href="https://colab.research.google.com/github/2303A52060/High-performace-computing/blob/main/H_P_C_ASS_3ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name:M.Bharath

Roll no:2303A52060

Batch no:39

A climate modeling application requires large matrix
multiplications, making Python loops a bottleneck.
Objective
Parallelize nested loops using Numba prange.
Tasks
1. Implement serial matrix multiplication.
2. Parallelize the outer loop.
3. Parallelize using collapsed loops logic.
4. Analyze cache behavior and performance.
Learning Outcomes
 Nested loop parallelism
 Memory access patterns
 Parallel overhead in Python

In [5]:
import numpy as np
from numba import njit, prange
import time

# ---------------- Serial Implementation ----------------
@njit
def matmul_serial(A, B):
    n, m = A.shape
    m2, p = B.shape
    assert m == m2

    C = np.zeros((n, p), dtype=np.float64)

    for i in range(n):
        for j in range(p):
            for k in range(m):
                C[i, j] += A[i, k] * B[k, j]
    return C


# ---------------- Parallel Outer Loop ----------------
@njit(parallel=True)
def matmul_parallel_outer(A, B):
    n, m = A.shape
    m2, p = B.shape
    assert m == m2

    C = np.zeros((n, p), dtype=np.float64)

    for i in prange(n):
        for j in range(p):
            for k in range(m):
                C[i, j] += A[i, k] * B[k, j]
    return C


# ---------------- Cache-Optimized Parallel ----------------
@njit(parallel=True, fastmath=True)
def matmul_parallel_cache_opt(A, B):
    n, m = A.shape
    m2, p = B.shape
    assert m == m2

    B_T = B.T  # transpose for cache-friendly access
    C = np.zeros((n, p), dtype=np.float64)

    for i in prange(n):
        for j in range(p):
            tmp = 0.0
            for k in range(m):
                tmp += A[i, k] * B_T[j, k]
            C[i, j] = tmp

    return C


# ---------------- Main Program ----------------
if __name__ == "__main__":

    # -------- User Input --------
    n = int(input("Enter number of rows for Matrix A: "))
    m = int(input("Enter number of columns for Matrix A: "))
    p = int(input("Enter number of columns for Matrix B: "))

    print("\nEnter elements of Matrix A row-wise:")
    A = np.zeros((n, m), dtype=np.float64)
    for i in range(n):
        A[i, :] = list(map(float, input(f"Row {i+1}: ").split()))

    print("\nEnter elements of Matrix B row-wise:")
    B = np.zeros((m, p), dtype=np.float64)
    for i in range(m):
        B[i, :] = list(map(float, input(f"Row {i+1}: ").split()))

    # -------- Warm-up (Numba compilation) --------
    matmul_serial(A, B)
    matmul_parallel_outer(A, B)
    matmul_parallel_cache_opt(A, B)

    # -------- Timing Serial --------
    start = time.perf_counter()
    C_serial = matmul_serial(A, B)
    t_serial = time.perf_counter() - start

    # -------- Timing Parallel Outer --------
    start = time.perf_counter()
    C_outer = matmul_parallel_outer(A, B)
    t_outer = time.perf_counter() - start

    # -------- Timing Cache-Optimized Parallel --------
    start = time.perf_counter()
    C_opt = matmul_parallel_cache_opt(A, B)
    t_opt = time.perf_counter() - start

    # -------- Results --------
    print("\nResult (Serial):")
    print(C_serial)

    print("\nResult (Parallel Outer Loop):")
    print(C_outer)

    print("\nResult (Cache-Optimized Parallel):")
    print(C_opt)

    # -------- Performance --------
    print("\nExecution Time:")
    print(f"Serial: {t_serial:.6f} seconds")
    print(f"Parallel Outer: {t_outer:.6f} seconds")
    print(f"Cache-Optimized Parallel: {t_opt:.6f} seconds")

    # -------- Correctness Check --------
    print("\nCorrectness Check:")
    print("Serial vs Outer Parallel:", np.allclose(C_serial, C_outer))
    print("Serial vs Optimized:", np.allclose(C_serial, C_opt))


Enter number of rows for Matrix A: 3
Enter number of columns for Matrix A: 3
Enter number of columns for Matrix B: 3

Enter elements of Matrix A row-wise:
Row 1: 1 2 3
Row 2: 1 3 4
Row 3: 1 6 9

Enter elements of Matrix B row-wise:
Row 1: 2 3 4 
Row 2: 2 10 14
Row 3: 23 26 27

Result (Serial):
[[ 75. 101. 113.]
 [100. 137. 154.]
 [221. 297. 331.]]

Result (Parallel Outer Loop):
[[ 75. 101. 113.]
 [100. 137. 154.]
 [221. 297. 331.]]

Result (Cache-Optimized Parallel):
[[ 75. 101. 113.]
 [100. 137. 154.]
 [221. 297. 331.]]

Execution Time:
Serial: 0.000011 seconds
Parallel Outer: 0.000024 seconds
Cache-Optimized Parallel: 0.000013 seconds

Correctness Check:
Serial vs Outer Parallel: True
Serial vs Optimized: True
