### Parallel Average (C) [6 points]

[Pthreads](https://hpc-tutorials.llnl.gov/posix/) is a library commonly available for Unix, as well as other operating systems. It is standardized by POSIX. In C, using Pthreads requires:
```C
# include <pthread.h>
```
For each thread, the following variables have to be declared and initialized:
```C
pthread_attr_t tattr;       /* thread attributes */
pthread_t tid;              /* thread descriptor */

pthread_attr_init(&tattr);  /* default values */
pthread_attr_setscope(&tattr, PTHREAD_SCOPE_SYSTEM);
```
The default values include a default stack size. Threads can compete for scheduling locally (default), i.e. with threads of the same process, globally (system scope), i.e. with all other threads. The fork operation takes a previously set attribute descriptor, the function to be started, and its parameters, and initializes a thread descriptor (first argument):
```C
pthread_create(&tid, &tattr, start_func, arg);
```
A thread terminates if the function body completes or by calling:
```C
pthread_exit(value);
```
A parent thread can wait for a child to terminate; it needs to specify the descriptor and the location for the return value of the child:
```C
pthread_join(tid, value_ptr);
```

---

Consider the following program: It uses two _worker threads_ in C to search for an element in an array. The program won't necessarily be faster than a sequential one, but it illustrates the concepts. The two workers do not communicate with each other, but the main program collects the results. Thus this is an example of "embarrassing parallelism"; concurrency is used to potentially achieve a speedup.

In [37]:
%%writefile ParallelFind.c
#include <pthread.h>
#include <stdbool.h>
#include <stdio.h>

#define SHARED 1
#define N 100

struct Args {int x; int l; int u; bool found;};
int a[N];

void *worker(struct Args *arg) {
    // 0 <= arg->l <= arg->u <= N
    for (int i = arg->l; i < arg->u; i++)
        if (a[i] == arg->x) {arg->found = true; return NULL;}
    arg->found = false;
}

int main(int argc, char *argv[]) {
    pthread_t w0, w1;
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
    // populate array a with N "random" values
    for (int i = 0; i < N; i++) a[i] = i;
    
    struct Args a0 = {42, 0, N / 2};
    struct Args a1 = {42, N / 2, N};
    pthread_create(&w0, &attr, worker, &a0);
    pthread_create(&w1, &attr, worker, &a1);
    pthread_join(w0, NULL);
    pthread_join(w1, NULL);
    printf("%d %d\n", a0.found, a1.found);
}

Overwriting ParallelFind.c


Creating a thread by `pthread_create(&w0, &attr, worker, &a0)` starts the function `worker` with the parameter `&a0` as a new thread and assigns the id of the thread to `w0`.

Run the next cells to test whether `42` appears in the lower half or upper half:

In [38]:
!gcc ParallelFind.c -lpthread -Wno-incompatible-pointer-types -o ParallelFind

In [39]:
!./ParallelFind

1 0


---

The task is to compute the average of `n` numbers `a(0)`, ..., `a(n – 1)`. For example, for `n = 5`, the average can be computed in different ways:

      (a(0) + a(1) + a(2) + a(3) + a(4)) / 5
    = a(0) / 5 + a(1) / 5 + a(2) / 5 + a(3) / 5 + a(4) / 5
    = (a(0) + a(1) + a(2)) / 5 + (a(3) + a(4)) / 5

The last variant suggests a computation in parallel: one thread computes `(a(0) + a(1) + a(2)) / 5`, and a second thread computes `(a(3) + a(4)) / 5`; the main program collects the results of the two threads and adds them.

The program below computes the average of `n` random integers sequentially; you are asked to complete the parallel computation with two workers, following `ParallelFind`. The average is computed in both ways, and the times the sequential and parallel computation take are printed. The program reads `n` from the command line to make testing easier. [4 points]

In [2]:
%%writefile Average.c
#include <pthread.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

#define SHARED 1

struct Args {int *a; int l; int u; int n; double avg;};

void *worker(struct Args *arg) {
    // arg.a has arg.n elements && 0 <= arg.l <= arg.u <= arg.n
    double s = 0;
    for (int i = arg->l; i < arg->u; i++) s += arg->a[i];
    arg->avg = s / arg->n;
    return NULL;
}

double sequentialaverage(int a[], int n) {
    // a has n elements
    double s = 0;
    for (int i = 0; i < n; i++) s += a[i];
    return s / n;
}

static double parallelaverage(int a[], int n) {
    // a has n elements
    pthread_t w0, w1;
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);

    struct Args a0 = {a, 0, n / 2, n};
    struct Args a1 = {a, n / 2, n, n};
    pthread_create(&w0, &attr, worker, &a0);
    pthread_create(&w1, &attr, worker, &a1);
    pthread_join(w0, NULL);
    pthread_join(w1, NULL);

    return a0.avg + a1.avg;
}

int main(int argc, char *argv[]) {
    
    int n = atoi(argv[1]);
    int a[n];
    srand(time(NULL));
    for (int i = 0; i < n; i++) a[i] = rand() % 10000;
    
    struct timeval start, end;
    gettimeofday(&start, 0);
    double avg = sequentialaverage(a, n);
    gettimeofday(&end, 0);
    long seconds = end.tv_sec - start.tv_sec;
    long microseconds = end.tv_usec - start.tv_usec;
    long elapsed = seconds * 1e6 + microseconds;
    printf("Sequential: %f Time: %i microseconds\n", avg, elapsed);
    
    gettimeofday(&start, 0);
    avg = parallelaverage(a, n);
    gettimeofday(&end, 0);
    seconds = end.tv_sec - start.tv_sec;
    microseconds = end.tv_usec - start.tv_usec;
    elapsed = seconds * 1e6 + microseconds;
    printf("Parallel:   %f Time: %i microseconds\n", avg, elapsed);
}

Overwriting Average.c


In [3]:
!gcc Average.c -lpthread -Wno-incompatible-pointer-types -o Average

Run your implementation with the following values of `n`; you may also include more values. As each run can produce different timing results, run your implementation with the same value of `n` several times. The above program measures the elapsed time, not the CPU time. If there are other processes (users) on the same CPU, the elapsed time will be larger than the CPU time. If you are using a server, choose a time of the day with few other users. In multiple runs with the same parameter, smaller times approximate the CPU time better.

In [4]:
!./Average 10

Sequential: 4893.700000 Time: 0 microseconds
Parallel:   4893.700000 Time: 696 microseconds


In [5]:
!./Average 100

Sequential: 4556.620000 Time: 1 microseconds
Parallel:   4556.620000 Time: 677 microseconds


In [6]:
!./Average 1000

Sequential: 5037.065000 Time: 3 microseconds
Parallel:   5037.065000 Time: 705 microseconds


In [7]:
!./Average 10000

Sequential: 4997.253700 Time: 28 microseconds
Parallel:   4997.253700 Time: 708 microseconds


In [13]:
!./Average 100000

Sequential: 5005.692310 Time: 282 microseconds
Parallel:   5005.692310 Time: 787 microseconds


In [11]:
!./Average 1000000

Sequential: 4997.534476 Time: 2877 microseconds
Parallel:   4997.534476 Time: 2138 microseconds


In [36]:
!./Average 500000

Sequential: 4994.938802 Time: 1488 microseconds
Parallel:   4994.938802 Time: 1931 microseconds


How large has `n` to be such that there is a speedup of the parallel version? Add additional cells as you like. State your answer in the cell below! State the processor (model, frequency, number of cores) on which you ran the test; do some research on your own on how to find that out from the command line. [2 point]

from my tests I've noticed that Sequential is faster that parallel computation up until a list size of 500,000, numbers between 400000-500000 its a 50/50 between which computation will be faster, numbers lower than 400000 sequential is always faster. numbers greater than 500000 the parallel computation is always faster. 

Processor: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 16 cores