# **Matrix and Vector operations/calculations**

Matrix/Vector operations are extremely important topic in high performance computing, because many of the large calculations are related to that. Also in fields like ML/ Data science, Image processing, Computer Graphics etc. Matrices are extremely important due to their properties.

In this note book first we will define a sample such problem as running example and use it to explore techniques related to Matrix/Vector operations.


Running Problem:- Diffusion of fluids (example for diffusion is dye in water).

The one dimensional partial differential equation of diffusion is as follows.

<center><image src="./img/5.jpg" width="250"/></center>

Here, `u` is the vector representing the quatities we are diffusing. For example this could represent the dye value in the location of the fluid space. In general, this will be a 2D or 3D matrix representing an actual area or volume of fluid. `D` is a value that represents the properties of the fuild we are considering. `x`, `t` represent the direction and time respectively.

Since we cant directly calculate the continuous values, we approximate above equation as follows using Eulars method.(Basically we use the differenciation limit formula for the approximation.)

<center><image src="./img/6.jpg" width="500"/></center>

Above `dt` represents the framerate(minimum time frame), `dx` represents the resolution of the image ie: smaller the dx smaller a region the matrix u cell represents. Generally we consider dx to be 1.

The basic psuedo code implementation of above for 1D (one dimension) is as follows.


<pre style="color:yellow">

# Initializing the data and conditions
u = space vector of size N
for i in range(N):
    u[i] = 0 if water else 1 #(if theres dye)

D = 1
dx = 1
t = 0
dt = 0.0001

while(True):

    print(f'time: {t}')

    u_next = vector of size N
    for i in range(N):
        u_next[i] = u[i] + D * dt * (u[(i+1)%N] + u[(i-1)%N] - 2 * u[i])


    u = u_next
    t = t + dt
    visualize(u)

</pre>

In above code all the parts which include `%N` are to solve the problem of out of bound values in the vector space. (ie: [0 - dx] or [N + dx] can be out of bound). Rather than doing that we can simply make them zero or use the same value as the closest value( same as convolutional neural networks case.)

Depending on the dimensions we need to consider, it is required to update the code. For example for 2D space diffision, we have 2 directions to consider for the diffusion gradient.

Below is an actual implementation of the diffusion algorithm for 2D space.
<center><image src="./img/7.jpg" width="400"/></center>

The below will act as the initialization of data and executor for the diffusion equation.

In [9]:
%%writefile scripts/run_experiment.py

import time

@profile
def step(grid, dt, D = 1.0):

    max_y, max_x = len(grid), len(grid[0])

    grid_next = [[0.0]*max_x]*max_y

    for i in range(max_y):
        for j in range(max_x):
            
            y_dir_gradient2 = (grid[(i + 1) % max_y][j] + grid[(i - 1) % \
                                                    max_y][j] - 2 * grid[i][j]) 
            x_dir_gradient2 = (grid[i][(j + 1) % max_x] + grid[i][(j - 1) % \
                                                     max_x] - 2 * grid[i][j])

            grid_next[i][j] = grid[i][j] + D * dt * (x_dir_gradient2 + y_dir_gradient2)

    return grid_next

def run_experiment(num_iterations, grid_shape):

    # Initial grid
    max_y, max_x = grid_shape

    grid = [[0.0]*max_x]*max_y

    # Simulating a drop of dye in the middle of our simulated region
    block_low = int(grid_shape[0] * 0.4)
    block_high = int(grid_shape[0] * 0.5)

    for i in range(block_low, block_high):
        for j in range(block_low, block_high):
            grid[i][j] = 0.005

    # Running the experiment
    st = time.time()
    for _ in range(num_iterations):
        grid = step(grid, 0.1)

    return time.time() - st

if __name__== '__main__':
    run_experiment(150, (640, 640))

Overwriting scripts/run_experiment.py


The line profling output for the above code segment is as follows.

<center><image src="./img/8.jpg" width="800"/></center>

As we can see, creation of new_grid in each step function call takes a huge amount of time per hit. This is in fact unnecessary as the dimensions of the grid does not change with time, which means we can essentially declare the list once and use it throughout all the steps. So more optimized code would be like below.

In [8]:
%%writefile scripts/run_experiment_segmented.py

import time

@profile
def step(grid, dt, grid_next, D = 1.0):

    max_y, max_x = len(grid), len(grid[0])

    for i in range(max_y):
        for j in range(max_x):
            
            y_dir_gradient2 = (grid[(i + 1) % max_y][j] + grid[(i - 1) % \
                                                    max_y][j] - 2 * grid[i][j]) 
            x_dir_gradient2 = (grid[i][(j + 1) % max_x] + grid[i][(j - 1) % \
                                                     max_x] - 2 * grid[i][j])

            grid_next[i][j] = grid[i][j] + D * dt * (x_dir_gradient2 + y_dir_gradient2)


def run_experiment(num_iterations, grid_shape):

    # Initial grid
    max_y, max_x = grid_shape

    grid = [[0.0]*max_x]*max_y

    # Simulating a drop of dye in the middle of our simulated region
    block_low = int(grid_shape[0] * 0.4)
    block_high = int(grid_shape[0] * 0.5)

    for i in range(block_low, block_high):
        for j in range(block_low, block_high):
            grid[i][j] = 0.005


    # Predefining the next grid
    grid_next = [[0.0]*max_x]*max_y

    # Running the experiment
    st = time.time()
    for _ in range(num_iterations):
        step(grid, 0.1, grid_next)

        # Swapping the variables because values inside grid does not matter for next iteration
        grid, grid_next = grid_next, grid

    return time.time() - st

if __name__== '__main__':
    run_experiment(150, (640, 640))

Overwriting scripts/run_experiment_segmented.py


<center><image src="./img/9.jpg" width="800"/></center>

We can see that the execution time has been reduced by about 20% just by separating the time consuming one time operations. But still we can see that, most of the operations we do in here is very similar. So doing this in vectorized fashion makes more sense. But python does not support vectorization natively and therefore cannot recognize whether it is possible to optimize a code segment.


In python data in lists are stores as pointers. Which means original data will be in some place in memory while the list stores it's address. This is generally great for many use cases. But not for Matrix/ Vector based situations. For example consider the `data[x][y]` list item. Since we are accesing using pointers, python need to do 2 memory lookups just to locate one item. This is generally fine. But for many such operations it can be costly. 

Also in most cases other nearby items needed for calculations as well. Therefore as an optimaization we can take that block of memory in one operation rather than locating indivudual item (caching). But if the data is fragmented in the memory this cannot be done and therefore CPU need to wait for the data to come (cache misses).

To avoid above cache missing problem, modern CPUs have few mechanisms namely `branch prediction` and `pipelining`, which try to predict the next instruction and load the relevant data before hand to improve the CPU performance. But still the most effective way of solving the mentioned problem is allocating the memory in a proper way.

In performance analysis, we can identify the how the data is getting moved to the CPU using various tools.
- Linux --> `perf` tool
- OSX   --> google's `gperftools` or provided `Instruments` tool
- Windows --> `Visual Studio Profiler `


**Important concepts to note:-**

* Difference between number of CPU instructions and CPU cycles indicate how well our code is vectorized and pipelined.
* CS (Context Switches) and migrations indicates how many times our program halted to let other applications run (may be due to I/O operations, waiting operations etc.). We cant control this happening, but can reduce the occurance by reducing blocking operations. (btw migrations means the CPU core which execute the program halt it and start the rest of the program in another core to make all cores evenly used.)
* Page faults occur when the requested memory is not available. Then kernal allocate/ read the memory to the cache. This is called `minor page fault` and problamatic in CPU bound situations. On the other hand `hard page faults` means required data is not available even in the memory, so need to read it from disk which obviously problematic in any program in general.
* It is important to lay the data in memory to reduce `cache-misses`.
* Branches in CPU optimization context is a point where the execution of the code flow change. For example if-else conditions, function calls can be considered as branches. Modern CPUs try to predict the flow before hand and load the instructions early. This can lead to more performance if prediction is correct. If they are wrong execution may take longer. This is called `branch miss`. For example some loops may run faster on sorted loops than non sorted one.


With those in mind, we can see that higher number of instructions executed for a single CPU cycle improve performance of our program. to achieve that we can make sure all the related data need for the future processing as well is pre-loaded to the cache. But since python lists store data as pointers, actual data may not in same memory block(fragmented) which can be copied to cache. To avoid this issue, we can use python `array` module instead of lists.

Python arrays store data sequentially in the memory rather than pointers. So now kernal can load continuous memory chunks to the cache, which in turn reduce cache misses. But this does not vectorize the process we require. In order to perform vectorized CPU operations we need to use special modules which instructs python to use them.

> Due to the nature of implementation of python `array` it is slower than normal lists when creating. Not only that, it has an overhead when reading values as well. Therefore generally not good for math operations. But for storing fixed type data.