# Homework 07 AI accelerators

## Sambanova

**Homework task: for BERT example, understand flags used in the script. Change values for flag `--ntasks` and measure its effect on performance.**

The original command used to train this model was:

```
/usr/local/bin/sbatch --output=${HOME}/slurm-%A.out --ntasks 16 --gres=rdu:8 --ntasks-per-node 16  --nodes 1 --nodelist $(hostname) --cpus-per-task=8  ${PROJ_DIR}/BertLarge_run.sh $1 >> ${OUTPUT_PATH} 2>&1
```

A task in this context represents a process that is going to be run as part of the job. This command says that It will run 16 tasks in total. It asks for only a single compute node with `nodes 1`. Also the `--ntasks-per-node 16` specifies that the 16 processes will be launched in the same node. `--cpus-per-task=8` says that each process will be using 8 cpus, making possible to paralellize across these CPUs for each task. Additionally, `--gres=rdu:8` specifies the number of RDU cores that will be used.

This yielded the following performance:

- e2e_train_time: 455.2987697124481 
- training_sequences_per_second: 581520.2183990465
- final_loss: 8.254524230957031
- training_samples_per_second: 4543.12670624255

I modified the `--ntasks` parameter and changed it to 8, I also modified the `--ntasks-per-node` to 8 in order to match the number of tasks. 

```
/usr/local/bin/sbatch --output=${HOME}/slurm-%A.out --ntasks 8 --gres=rdu:8 --ntasks-per-node 8  --nodes 1 --nodelist $(hostname) --cpus-per-task=8  ${PROJ_DIR}/BertLarge_run.sh $1 >> ${OUTPUT_PATH} 2>&1
```

This yielded the following performance:

- e2e_train_time: 440.9703402519226
- training_sequences_per_second: 300207.7643688482
- final_loss: 8.30881118774414
- training_samples_per_second: 2345.3731591316264

Since the number of tasks was decreased, the number of sequences and samples that the model was able to train per second also decreased. The training time decreased slightly but the final loss was greater than the first run.

## Graphcore

**Homework task: Run MNIST example by changing values of the input parameters like batch-size, learning rate and number of epochs trained and observe and report the performance implications.**

The original parameters of the MNIST example were the following:

- learning_rate = 0.03
- epochs = 10
- batch_size = 8
- test_batch_size = 80

The accuracy on the test set was 98.07%, with 10.97 seconds per iteration

I changed the parameters to have a higher learning rate by one order of magnitude and increased the batch size hoping for a faster training, since I used the same batch size:

- learning_rate = 0.3
- epochs = 10
- batch_size = 8
- test_batch_size = 100

The accuracy of the test set was 98.64%, with 8.73 seconds per iteration

This means that the training time was indeed decreased and it ended up having a better test set accuracy in the same number of epochs.

## Cerebras

**Homework task: Run the BERT example with a different batch size, like 512 or 2048 and observe the performance difference**

The original batch size was 1024, and the relevant outputs are:

- The loss for the final step 1000 was 7.07812
- Processed 1024000 sample(s) in 210.545817686 seconds

I first changed the batch size to 512, the relevant outputs are:

- The loss for the final step 1000 was 7.12500
- Processed 512000 sample(s) in 175.796076613 seconds

Then I changed the batch size to 2048, the relevant outputs are:

- The loss for the final step 1000 was 7.14844
- Processed 512000 sample(s) in 334.704423242 seconds

The fastest training time was for the batch size of 512, which is the smaller tested, yet the loss was the smallest for the batch size of 1024. Increasing the batch size from 1024 to 2048 didn't improve the loss and it took longer.

## Groq

**Homework task: Run the BERT example with custom input instead of dummy input**

The dummy input has `batch_size` = 1, and `max_seq_length` = 128

The final training metrics were:

```
+--------+----------+-------------------------+----------------+----------------------+-------------+
  | Source | Accuracy | end-to-end latency (ms) | end-to-end IPS | on-chip latency (ms) | on-chip IPS |
  +--------+----------+-------------------------+----------------+----------------------+-------------+
  |  cpu   |  77.47%  |           2.25          |     445.29     |          --          |      --     |
  |  groq  |  77.47%  |           0.06          |    16788.58    |         0.03         |   37576.72  |
  +--------+----------+-------------------------+----------------+----------------------+-------------+
```

I changed the inputs to: `batch_size` = 1 and `max_sequence_length` = 512

The final training metrics were:

```
+--------+----------+-------------------------+----------------+----------------------+-------------+
| Source | Accuracy | end-to-end latency (ms) | end-to-end IPS | on-chip latency (ms) | on-chip IPS |
+--------+----------+-------------------------+----------------+----------------------+-------------+
|  cpu   |  77.47%  |           9.26          |     108.03     |          --          |      --     |
|  groq  |  77.47%  |           0.13          |    7515.17     |         0.09         |   10953.97  |
+--------+----------+-------------------------+----------------+----------------------+-------------+
```

Reasonably, since I increased the input size, the latency increased, yet the accuracy was the same as the previous configuration.
