# DAY 18: Matrix Multiplication using cuBLAS

This notebook demonstrates matrix multiplication using the cuBLAS library's SGEMM (Single-precision GEneral Matrix Multiply) function.

## Key Concepts:
- cuBLAS SGEMM operation
- Matrix multiplication: C = α*A*B + β*C
- Column-major storage in cuBLAS
- GPU memory management for matrices
- Handle-based cuBLAS API

## Matrix Dimensions:
- Matrix A: 2×4
- Matrix B: 4×3  
- Matrix C: 2×3 (result)

In [None]:
%%writefile MatrixMultiplicationCublas.cu
// nvcc MatrixMultiplicationCublas.cu -o MatrixMultiplicationCublas -lstdc++ -lcublas

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <stdio.h>

int main() {
    cublasHandle_t handle;
    cublasCreate(&handle);

    int M = 2, N = 3, K = 4;
    float *h_A, *h_B, *h_C;
    h_A = (float *)malloc(M * K * sizeof(float));
    h_B = (float *)malloc(K * N * sizeof(float));
    h_C = (float *)malloc(M * N * sizeof(float));

    for (int i = 0; i < M; i++)
        for (int j = 0; j < K; j++)
            h_A[i * K + j] = i + j;

    for (int i = 0; i < K; i++)
        for (int j = 0; j < N; j++)
            h_B[i * N + j] = i + j;

    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, M * K * sizeof(float));
    cudaMalloc(&d_B, K * N * sizeof(float));
    cudaMalloc(&d_C, M * N * sizeof(float));

    cudaMemcpy(d_A, h_A, M * K * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, K * N * sizeof(float), cudaMemcpyHostToDevice);

    const float alpha = 1.0f, beta = 0.0f;
    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
                M, N, K, &alpha,
                d_A, M, d_B, K,
                &beta, d_C, M);

    cudaMemcpy(h_C, d_C, M * N * sizeof(float), cudaMemcpyDeviceToHost);

    printf("Matrix A:\n");
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < K; j++) {
            printf("%f ", h_A[i * K + j]);
        }
        printf("\n");
    }

    printf("Matrix B:\n");
    for (int i = 0; i < K; i++) {
        for (int j = 0; j < N; j++) {
            printf("%f ", h_B[i * N + j]);
        }
        printf("\n");
    }

    printf("Matrix C = A * B:\n");
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            printf("%f ", h_C[i + j * M]);
        }
        printf("\n");
    }

    free(h_A); free(h_B); free(h_C);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    cublasDestroy(handle);
    return 0;
}

In [None]:
# Compile and run the cuBLAS matrix multiplication program
!nvcc MatrixMultiplicationCublas.cu -o MatrixMultiplicationCublas -lstdc++ -lcublas
!./MatrixMultiplicationCublas

## Output:
```
Matrix A:
0.000000 1.000000 2.000000 3.000000 
1.000000 2.000000 3.000000 4.000000 

Matrix B:
0.000000 1.000000 2.000000 
1.000000 2.000000 3.000000 
2.000000 3.000000 4.000000 
3.000000 4.000000 5.000000 

Matrix C = A * B:
14.000000 20.000000 26.000000 
20.000000 30.000000 40.000000 
```

## Algorithm Explanation:
The cuBLAS SGEMM function performs: **C = α*A*B + β*C**
- α (alpha) = 1.0, β (beta) = 0.0
- Matrix A (2×4) multiplied by Matrix B (4×3) = Matrix C (2×3)
- cuBLAS uses column-major storage, which affects indexing in the result

## Key Learning Points:
1. **SGEMM Parameters**: M, N, K represent matrix dimensions
2. **Memory Layout**: cuBLAS expects column-major format
3. **Performance**: cuBLAS provides highly optimized GPU matrix operations
4. **Scalability**: Suitable for large-scale linear algebra computations