### Parallel Matrix Multiplication (Java) [6 points]

For `N × N` matrices `a, b, c: 0 .. N − 1 × 0 .. N − 1 → float` where `N ≥ 1`, sequential matrix multiplication is expressed by:

```algorithm
for i = 0 to N − 1 do
    for j = 0 to N − 1 do
        c(i, j) := 0.0
        for k = 0 to N − 1 do
            c(i, j) := c(i, j) + a(i, k) × b(k, j)
```

This algorithm can be parallelized by turning the two outer `for` loops into `par` statements. That would create `N²` processes, which may be far more than there are processors; the overhead of process creation may outweigh the benefit of parallelism. A solution is to use *one worker process per strip.*

Let `P` be the number of worker processes:

```algorithm
procedure worker(w: 0 .. P – 1)
    var first = w × N div P
    var last = (w + 1) × N div P – 1
    for i = first to last do
        for j = 0 to N – 1 do
            c(i, j) := 0.0
            for k = 0 to N – 1 do
                c(i, j) := c(i, j) + a(i, k) × b(k, j)

par w = 0 to P – 1 do worker(w)
```

For `P = 1`, the execution is sequential. Implement parallel matrix multiplication with workers in Java! Depending on your design, the template has parts that may or may not need to be filled in.

In [1]:
%%writefile Multiply.java
import java.util.Random;
import java.util.Arrays;

class Worker extends Thread {
    int[][] a, b, c;
    int N;
    int first;
    int last;

    Worker(int[][] a, int[][] b, int[][] c, int first, int last){
        this.a = a; this.b = b; this.c = c; this.first = first; this.last = last;
        this.N = a[0].length;
    }

    public void run() {
        for (int i = first; i <= last; i++){
            for (int j = 0; j < N; j++){
                c[i][j] = 0;
                for (int k = 0; k < N; k++){
                    c[i][j] = c[i][j] + a[i][k] * b[k][j];
                }
            }
        }
    }

}

public class Multiply {

    static int N;        // number of rows in Matrix
    static int P;        // number of workers
    static int[][] a, b; // randomly generated input matrices

    static void sequentialmultiply(int c[][]) {
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) {
                c[i][j] = 0; 
                for (int k = 0; k < N; k++) {
                    c[i][j] = c[i][j] + a[i][k] * b[k][j];
                }
            }
        }
    }

    static void parallelmultiply(int c[][]){
        Worker[] w = new Worker[P];

        for(int i = 0; i < P; i ++){
            int first = i*N / P;
            int last = (i+1)*N / P - 1;
            w[i] = new Worker(a, b, c, first, last);
        }

        for(Worker workers : w) workers.start();
        try{
            for(Worker workers : w) workers.join();
        } catch (Exception e) {}

    }

    public static void main(String args[]) {
        N = Integer.parseInt(args[0]);
        P = Integer.parseInt(args[1]);
        a = new int[N][N]; b = new int[N][N];
        int[][] cp = new int[N][N], cs = new int[N][N];
        Random random = new Random();
        for (int i = 0; i < N; i++) {
            for (int j=0; j < N; j++) {
                a[i][j] = random.nextInt(1000);
                b[i][j] = random.nextInt(1000);
            }
        }

        final long start = System.currentTimeMillis();
        parallelmultiply(cp);
        final long end = System.currentTimeMillis();
        
        sequentialmultiply(cs);  // check the correctness
        assert Arrays.deepEquals(cp, cs);
        
        System.out.println((end - start) + " ms"); 
    }
}

Overwriting Multiply.java


Use the cell below to test your implementation.

In [2]:
!javac Multiply.java
!java -enableassertions Multiply  100 2

15 ms


Now observe the elapsed time for multiplying `1000 × 1000` matrices with various values of `P`. You may add cells to record your observations. Run the program multiple times and keep the cell with the shortest elapsed time.

In [3]:
!java -enableassertions Multiply  1000 1

2308 ms


In [4]:
!java -enableassertions Multiply  1000 2

1199 ms


In [5]:
!java -enableassertions Multiply  1000 5

473 ms


In [6]:
!java -enableassertions Multiply 1000 10

349 ms


In [7]:
!java -enableassertions Multiply 1000 100

305 ms


In [8]:
!java -enableassertions Multiply 1000 200

305 ms


In [9]:
!java -enableassertions Multiply 1000 500

380 ms


In [10]:
!java -enableassertions Multiply 1000 1000

400 ms


Summarize your observations for what (approximate) value of `P` you get shorter and longer elapsed times and explain why!

as we increased the number of workers the time went down up untill 100 workers where i got the fastest time of 257ms. However, when we increased the workers to 200 the time started to slow down where 200 workers gave me 305ms, 500 workers gave me 324ms and 1000 workers gave me 410ms. You would expect for more workers the time would be faster however, from my tests that doesnt seem to be the case. From doing a little research online there are a couple reasons for this, such as: thread stack size, max virtual memory and what kind of kernel scheduler is being used

*Note.* The time complexity of the parallel matrix multiplication can be analyzed by the *work and depth model*. The *work* `𝒲` is the total time to execute the entire computation on one worker; the *depth* `𝒟` is the longest time to execute the computation on infinitely many workers. In other words, the work is the sum of times taken by all workers, and the depth is longest time taken by a worker. The parallel time complexity is `O(𝒲 / 𝒫 + 𝒟)`, where `𝒫` is the total number of workers that can separately perform the totally work. Here, each worker corresponds to a physical core, rather than a thread, since multiple thread can concurrently run on one physical core.   

- The work is the same as that of the sequential version, `O(N³)`.
- The depth is `O(N²)`, as only the outer loop is run in parallel for the implementation. 
- The total worst-case running time is `O(N³ / P + N²)`, where `P` is the number of workers.

We observe that theoretically each item in matrix `c(i,j)` can be independently calculate in parallel with `N²` workers with  depth `O(n)`. 

* With this implementation, we can get a depth of `O(N)`.
* In this case, the total parallel running time is `O(N³ / P + N)`, where `P` is the number of workers.