#Week 6 ME on BLAS and Optimized MMM
In this Machine Exercise, you will observe the performance of MMM with and without blocking to see how blocking may improve the performance of MMM. Then you will compare the performance of a matrix-multiply with blocking code with the built-in matrix multiply of Numpy which uses BLAS.

The objectives of this machine exercise are:
* for you to observe the effect of blocking on the performance of MMM
* for you to observe the difference in performance of MMM using BLAS

# Part I. Benchmarking MMM with and without Blocking
The code below implements matrix-matrix multiplication between two square matrices. The `matmult` function implements a basic "ijk" matrix multiplication. The `matmultblk` function implements matrix multiplication with blocking.

The benefit of using a blocking algorithm would be more obvious with larger matrices. Furthermore, blocking will only truly optimize the operation if the correct block size is selected.

Observe the performance improvement provided by blocking with varying sizes of `n` (we are using n-by-n matrices) and the block size `N` [(n/N)x(n/N) blocks] (Warning, larger sizes of matrices will take a longer time to run). ***NOTE: The code provided only works with values of `n` and `N` that are exactly divisible. Please select your `n` and `N` values accordingly, or you may edit the code to remove this limitation.***

Run experiments for the scenarios below. Record the sizes of `n`,`N`, and the runtimes that you observe:


*   Small matrices (small `n`; ensure `N < n`)
*   Large matrices (large `n`; try increasing values of `N`)


After experimenting with the code below answer the following questions:

1. As you increase the value of `N`, how do your block sizes change?
2. For small matrices, does the algorithm with blocking perform better?
3. Why or why not does blocking affect the performance of the multiplication?
4. For large matrices, how does the value of `N` affect the performance of the algorithm with blocking?
5. Why does `N` influence the amount of optimization achieved by blocking?
6. **BONUS** [optional] Choose another way of optimizing MMM and write your own implementation of it. Analyze the performance of your code against the simple ijk, and the algorithm with blocking.

In [22]:
#!/usr/bin/env python

import random
from time import time
import math

random.seed(0)

def matmult(a,b,c,n):
  for i in range(n):
    for j in range(n):
      for k in range(n):
        c[i][j] = c[i][j] + a[i][k]*b[k][j]
  return c

def matmultblk(a,b,c,n,N):
# divide the nxn matrix into subblocks, resulting in an NxN matrix
# each block is (n/N)x(n/N)
  blk = math.floor(n/N)
  for kk in range(0,n,blk): 
    #print("kk: ",kk)
    for jj in range(0,n,blk):
      #print("jj: ",jj)
      for i in range(n):
        for j in range(jj,jj+blk):
          for k in range(kk,kk+blk):
            #print("A[%d][%d], B[%d][%d]" %(i,k,k,j))
            c[i][j] = c[i][j] + a[i][k]*b[k][j]
        #print(" ")
  return c

def init_matrix(mat,n,value):
  for x in range(n):
    new = []
    for y in range(n):
      new.append(value)
    mat.append(new)

def init_rand(mat,n):
  for x in range(n):
    new = []
    for y in range(n):
      new.append(random.random())
    mat.append(new)

def fill_zero(mat,n):
  for x in range(n):
    for y in range(n):
     mat[x][y] = 0

#@title Select the matrix and block sizes such that n is divisible by N
n = 512 #@param{type: "number"}
N = 1 #@param{type: "number"}
#Small matrix- n=32, N=4; Large matrix- n=512, N=32

A = []
B = []
C = []

init_rand(A,n)
init_rand(B,n)
init_matrix(C,n,0)

t = time()
c = matmult(A,B,C,n)
runtime = time() - t
print("MMM w/o blocking completed in %f seconds" % (runtime))
#print(C)

fill_zero(C,n) #reset C to zeros

t = time()
c_block = matmultblk(A,B,C,n,N)
runtime = time() - t
print("MMM with blocking completed in %f seconds" % (runtime))
#print(C)

print("Verifying if blocking and nonblocking algorithms have equal results...")
if c == c_block:
  print("OK!")
else:
  print("Not ok.")

SyntaxError: invalid syntax (1677955573.py, line 52)

#Part 2. Is Numpy Optimized for Speed?
## A. Determine what BLAS library is being used by Numpy
To determine the BLAS library used by Numpy, you can run the `show_config()` function as shown in the code below.

In [34]:
#!/usr/bin/env python

import numpy as np
np.show_config()

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['openblas\\lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['openblas\\lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['openblas\\lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['openblas\\lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['openblas\\lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['openblas\\lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['openblas\\lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BL

1. Which library is Numpy configured to use?
2. Try the same command on your own machine. Which library is Numpy configured to use on your machine?

##B. Compare the performance of our MMM with blocking and Numpy's MMM
The code below performs matrix multiply two ways:
* using the previous code with blocking
* using Numpy's built in matmul() function

Run the code for various values of n and N and answer the questions below.

2. What is the difference in performance of `matmul()` and the provided matrix multiply code with blocking? Is it a significant difference?
3. Can you observe this difference in performance for all sizes of matrices?
4. What are the reasons behind the difference in performance?
5. How attainable would it be to write your own code that has comparable performance with Numpy's `matmul()`?
6. If you were writing an application that performs a lot of linear algebra computations, how should you reorder your code to optimize its performance?

In [46]:
#!/usr/bin/env python

import numpy as np
import random
from time import time
import math
import copy

random.seed(0)

def init_matrix(mat,n,value):
  for x in range(n):
    new = []
    for y in range(n):
      new.append(value)
    mat.append(new)

def init_rand(mat,n):
  for x in range(n):
    new = []
    for y in range(n):
      new.append(random.random())
    mat.append(new)

def matmultblk(a,b,c,n,N):
# divide the nxn matrix into subblocks, resulting in an NxN matrix
# each block is (n/N)x(n/N)
  blk = math.floor(n/N)
  for kk in range(0,n,blk): 
    #print("kk: ",kk)
    for jj in range(0,n,blk):
      #print("jj: ",jj)
      for i in range(n): 
        for j in range(jj,jj+blk):
          for k in range(kk,kk+blk):
            #print("A[%d][%d], B[%d][%d]" %(i,k,k,j))
            c[i][j] = c[i][j] + a[i][k]*b[k][j]
        #print(" ")
  return c

#@title Select the matrix size and block size such that n is divisible by N
n = 32 #@param{type: "number"}
N =  4 #@param{type: "number"}
#Small matrix- n=32, N=4; Large matrix- n=512, N=32

A = []
B = []
C = []

init_rand(A,n)
init_rand(B,n)
init_matrix(C,n,0)

t = time()
c_block = matmultblk(A,B,C,n,N)
runtime = time() - t
print("MMM with blocking completed in %f seconds" % (runtime))

t = time()
c_np = np.matmul(A,B)
runtime = time() - t
print("Numpy matmul() completed in %f seconds" % (runtime))


MMM with blocking completed in 0.002997 seconds
Numpy matmul() completed in 0.000000 seconds


# C. Faster Matrix Multiplications
Check out some of the tips for speeding up matrix multiplications on Numpy on [this reference](https://www.benjaminjohnston.com.au/matmul). Research other ways of speeding up matrix-matrix multiplications on python using BLAS and other libraries. **Write your own code to experiment an alternative way of performaing matrix-matrix multiplication using available libraries. Try to find ways to further speed up the performance of matrix-matrix multiplication versus numpy.matmul().**

Benchmark/compare the speed/performance and, if possible, accuracy of your method with the methods in the previous section (code with blocking, numpy matmul). Answer the questions below. (Please include your code for this section in your SE documentation)


1.   What alternative method for performing matrix-matrix multiplication did you try? Explain how you tried to speed up MMM with this method.
2.   In your experiments, were you able to speed up matrix-matrix multiplication compared to numpy.matmul()?
*   If yes, did this speed-up apply for all sizes of matrices? 

*   If the approach you tested performed worse, what do you think may be the reason for poorer performance?




In [21]:
import numpy as np
import scipy
import array
import random
from time import time
import math
import copy

random.seed(0)

def init_matrix(mat,n,value):
  for x in range(n):
    new = []
    for y in range(n):
      new.append(value)
    mat.append(new)

def init_rand(mat,n):
  for x in range(n):
    new = []
    for y in range(n):
      new.append(random.random())
    mat.append(new)

def matmultblk(a,b,c,n,N):
# divide the nxn matrix into subblocks, resulting in an NxN matrix
# each block is (n/N)x(n/N)
  blk = math.floor(n/N)
  for kk in range(0,n,blk): 
    #print("kk: ",kk)
    for jj in range(0,n,blk):
      #print("jj: ",jj)
      for i in range(n): 
        for j in range(jj,jj+blk):
          for k in range(kk,kk+blk):
            #print("A[%d][%d], B[%d][%d]" %(i,k,k,j))
            c[i][j] = c[i][j] + a[i][k]*b[k][j]
        #print(" ")
  return c

#@title Select the matrix size and block size such that n is divisible by N
n = 32 #@param{type: "number"}
N =  4 #@param{type: "number"}
#Small matrix- n=32, N=4; Large matrix- n=512, N=32

A = []
B = []
C = []

init_rand(A,n)
init_rand(B,n)
init_matrix(C,n,0)

t = time()
c_block = matmultblk(A,B,C,n,N)
runtime = time() - t
print("MMM with blocking completed in %f seconds" % (runtime))

t = time()
c_np = np.matmul(A,B)
runtime = time() - t
print("Numpy matmul() completed in %f seconds" % (runtime))

#Own implementation
t = time()
c_own = scipy.linalg.blas.sgemm(1,A,B)
runtime = time() - t
print("Own implementation completed in %f seconds" % (runtime))

print("c_block \n" + str(c_block) + "\n")
print("c_np \n" + str(c_np) + "\n")
print("c_own \n" + str(c_own) + "\n")

MMM with blocking completed in 0.003000 seconds
Numpy matmul() completed in 0.000000 seconds
Own implementation completed in 0.000000 seconds
c_block 
[[9.36049405419418, 9.420124125386426, 10.219767698498261, 9.917608712228937, 9.932886305828166, 11.244888598072093, 11.937447892822819, 9.119707139843259, 9.374824287188682, 9.511275918164085, 9.922035727221846, 12.321457274992289, 8.998997903540511, 10.751414842598919, 10.808147908035538, 10.305955521091226, 10.197368381609946, 8.090015098631003, 9.86391190045289, 9.606368784130456, 9.412002453611557, 9.290280630654635, 10.385824299241413, 10.479646786434614, 9.127492230072086, 11.794115105108345, 9.130347476524216, 10.910691858219142, 11.243752486669262, 11.461525990035636, 11.12864591985633, 10.324764613711086], [8.028640833467177, 7.639044289617662, 9.146965188560678, 7.9593975355263495, 9.202140698861207, 9.257557794457458, 10.26740173222553, 7.080663016602286, 7.399858139302395, 8.193314061563294, 8.268303353376009, 10.81627492418