<a href="https://colab.research.google.com/github/Despicable-bee/PatternFlow/blob/s4484282/StyleGAN2_ADA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training OASIS brain using StyleGAN2-ADA
By Harry Nowakowski



# Verify that our runtime is a GPU
The great thing about google Colaboratory is you don't need to set up your own compute cluster (google is nice enough to provide one for you via web browser :D )

In the menu, select Runtime -> Change Runtime Type

Here you'll be able to verify if you're using a **GPU** (or even a **TPU** if you're feeling fancy).

To verify that you're all set up, run the following command:

In [1]:
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-2acacab1-d15a-ce5f-228f-d07d4c7d3323)


You should get something like:

`GPU 0: Tesla K80 (UUID: GPU-.....)`

This means its working :D

# Mount your Google Drive
We'll be storing the training models and progress images on Google Drive (because this is a university report and it needs to get marked off).

It also means that if you come from a place like Australia, if your Colab notebook gets disconnected, we won't lose the model (wow they really do think of everything).

# EDIT:
Google is also not made of money, so after 12 hours they'll probably take your GPU away (:c). To get around this, we can use our local GPU by running a [local runtime](https://research.google.com/colaboratory/local-runtimes.html).

I'm using an Nvidia GeForce GTX 980, which only has 4GB of video memory (which is fine for games, but rubbish for ML).
So it will take longer, however since CoLab pro isn't technically available in Australia, I have to make do.

Doing this however will prevent you from using the `google.colab` library (so mounting your google drive will be a bit harder).

However, now we have access to the RAM and Disk-space rich resource known as **Your own computer**.

So you can just write to your local storage and everything will be fine :)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Local dependencies
If you're not using Google CoLab's online features anymore, you'll need to install pytorch and a bunch of other libs locally.

Pytorch has a really nice UI for doing this as well, check out their website [here](https://pytorch.org/get-started/locally/). It allows you to pick your installation options from the UI! (nifty!)

In [15]:
!pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio===0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.9.1+cu111
  Downloading https://download.pytorch.org/whl/cu111/torch-1.9.1%2Bcu111-cp37-cp37m-linux_x86_64.whl (2041.3 MB)
[K     |█████████████                   | 834.1 MB 1.3 MB/s eta 0:15:08tcmalloc: large alloc 1147494400 bytes == 0x558807f3c000 @  0x7fe31d55a615 0x5588049324cc 0x558804a1247a 0x5588049352ed 0x558804a26e1d 0x5588049a8e99 0x5588049a39ee 0x558804936bda 0x5588049a8d00 0x5588049a39ee 0x558804936bda 0x5588049a5737 0x558804a27c66 0x5588049a4daf 0x558804a27c66 0x5588049a4daf 0x558804a27c66 0x5588049a4daf 0x558804937039 0x55880497a409 0x558804935c52 0x5588049a8c25 0x5588049a39ee 0x558804936bda 0x5588049a5737 0x5588049a39ee 0x558804936bda 0x5588049a4915 0x558804936afa 0x5588049a4c0d 0x5588049a39ee
[K     |████████████████▌               | 1055.7 MB 1.4 MB/s eta 0:12:08tcmalloc: large alloc 1434370048 bytes == 0x55884c592000 @  0x7fe31d55a615 0x5588049324cc 0x558804a1247a 0x5588049352e

# Install StyleGAN2-ada pytorch prerequisites
The black magic that makes the wizz bizz happen ;)

In [1]:
import torch

Check that pytorch has recognised our connected graphics card.

To do this, run the following code, and we should get "1" in the console output.

In [2]:
torch.cuda.device_count()

1

In [3]:
import torchvision

In [4]:
!pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3

Collecting pyspng
  Downloading pyspng-0.1.0-cp37-cp37m-manylinux2010_x86_64.whl (195 kB)
[K     |████████████████████████████████| 195 kB 8.3 MB/s 
[?25hCollecting ninja
  Downloading ninja-1.10.2.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB)
[K     |████████████████████████████████| 108 kB 59.0 MB/s 
[?25hCollecting imageio-ffmpeg==0.4.3
  Downloading imageio_ffmpeg-0.4.3-py3-none-manylinux2010_x86_64.whl (26.9 MB)
[K     |████████████████████████████████| 26.9 MB 1.3 MB/s 
Installing collected packages: pyspng, ninja, imageio-ffmpeg
Successfully installed imageio-ffmpeg-0.4.3 ninja-1.10.2.2 pyspng-0.1.0


You're probably wondering what we just installed, let me explain
- **torchvision**: A package that contains a bunch of popular datasets, model architectures, and image transformations for computer vision (so things like the `EMNIST` dataset and so on)
- **click**: "Command Line Interface Creation Kit", or "CLICK", is a package that enables the creation of command line interfaces (more beautifully and more easily)
- **requests**: The requests library, it allows us to send and recieve requests via HTTP.
- **tqdm**: One of my favourites, tqdm is a smart progress meter package. You can include these in your loops to show how things are progressing in your application (which you'll definitely need for StyleGAN).
- **pyspng**: Fast (and efficient) png decoder. It quickly and efficiently loads PNG files into numpy arrays.
- **ninja**: A small build system (with a focus on speed). High-level languages are slow as hell, so ninja aims to be "The assembler for python"
- **imageio-ffmpeg**: FFMPEG wrapper for python (necessary when we want to make videos from a bunch of images). 


# Getting the StyleGAN code

Even though this is technically the hardest project on the projects sheet of paper, the worst of the demon magic is mostly done for us with the styleGAN2 package.

In [11]:
!git clone https://github.com/NVlabs/stylegan2-ada-pytorch.git

Cloning into 'stylegan2-ada-pytorch'...
remote: Enumerating objects: 125, done.[K
remote: Total 125 (delta 0), reused 0 (delta 0), pack-reused 125[K
Receiving objects: 100% (125/125), 1.12 MiB | 21.68 MiB/s, done.
Resolving deltas: 100% (55/55), done.


In [None]:
mkdir /content/stylegan2-ada-pytorch/datasets

In [None]:
cd /content/stylegan2-ada-pytorch/datasets

/content/stylegan2-ada-pytorch/datasets


Check we're in the right directory

In [None]:
!dir

 Volume in drive C has no label.
 Volume Serial Number is 429D-C529

 Directory of C:\Users\shado\Documents\COMP3710\content\stylegan2-ada-pytorch\datasets

05/10/2021  01:07 PM    <DIR>          .
05/10/2021  01:07 PM    <DIR>          ..
               0 File(s)              0 bytes
               2 Dir(s)  626,752,319,488 bytes free


# Getting the OASIS brain dataset of images
Lets get the dataset from blackboard

In [None]:
!wget -c https://cloudstor.aarnet.edu.au/plus/s/tByzSZzvvVh0hZA/download -O oasis-preproc.zip

--2021-10-05 23:32:27--  https://cloudstor.aarnet.edu.au/plus/s/tByzSZzvvVh0hZA/download
Resolving cloudstor.aarnet.edu.au (cloudstor.aarnet.edu.au)... 202.158.207.20
Connecting to cloudstor.aarnet.edu.au (cloudstor.aarnet.edu.au)|202.158.207.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Syntax error in Set-Cookie: 5230042dc1897=5af2217a18275a9da6ff8734c9fb4753; path=/plus;; Secure at position 59.
Syntax error in Set-Cookie: oc_sessionPassphrase=yiBCz0ZZmzhgJ%2F5EP527a3uVg77yoKY%2FUtRh2hWxpY8inReb0RK2SLIsHAo3hS3MnmJ7JqKcBcs5EVLTnsuYtNDW%2Bs7JwTG2XKQaFqM8INgMF5lpRKRVKgVmB6FNgCTk; path=/plus;; Secure at position 168.
Length: 269958788 (257M) [application/zip]
Saving to: ‘oasis-preproc.zip’


2021-10-05 23:33:03 (8.99 MB/s) - ‘oasis-preproc.zip’ saved [269958788/269958788]



Or, if you're running this locally on windows, you can use the following:

In [None]:
!curl https://cloudstor.aarnet.edu.au/plus/s/tByzSZzvvVh0hZA/download -o oasis-preproc.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:06 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:07 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0
  0     0    0     0    0     0      0      0 --:--

Now unzip the file

In [None]:
!unzip oasis-preproc.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: keras_png_slices_data/keras_png_slices_train/case_269_slice_31.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_269_slice_4.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_269_slice_5.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_269_slice_6.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_269_slice_7.nii.png  
 extracting: keras_png_slices_data/keras_png_slices_train/case_269_slice_8.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_269_slice_9.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_270_slice_0.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_270_slice_1.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_270_slice_10.nii.png  
  inflating: keras_png_slices_data/keras_png_slices_train/case_270_slice_11.nii.png

Or if you're running this on windows, use the following (For windows 10 build 17063 or later)

In [None]:
!tar -xf oasis-preproc.zip

# Prepare OASIS dataset for use by styleGAN
While UQ was nice enough to preprocess the image data for us, it still isn't in the standard form that styleGAN expects (000001.png, 000002.png, etc)

Luckily, the (genius) folks at Nvidia have thought of this already, and included a nice `dataset_tool.py` file to do this formatting and conversion for us :)

In [None]:
cd ..

/content/stylegan2-ada-pytorch


Check where we are (windows)

In [None]:
!dir

 Volume in drive C has no label.
 Volume Serial Number is 429D-C529

 Directory of C:\Users\shado\Documents\COMP3710\content\stylegan2-ada-pytorch

05/10/2021  01:07 PM    <DIR>          .
05/10/2021  01:07 PM    <DIR>          ..
05/10/2021  01:05 PM    <DIR>          .github
05/10/2021  01:05 PM                23 .gitignore
05/10/2021  01:05 PM             8,526 calc_metrics.py
05/10/2021  01:23 PM    <DIR>          datasets
05/10/2021  01:05 PM            18,320 dataset_tool.py
05/10/2021  01:05 PM    <DIR>          dnnlib
05/10/2021  01:05 PM               919 Dockerfile
05/10/2021  01:05 PM             1,234 docker_run.sh
05/10/2021  01:05 PM    <DIR>          docs
05/10/2021  01:05 PM             5,467 generate.py
05/10/2021  01:05 PM            16,824 legacy.py
05/10/2021  01:05 PM             4,518 LICENSE.txt
05/10/2021  01:05 PM    <DIR>          metrics
05/10/2021  01:05 PM             9,202 projector.py
05/10/2021  01:05 PM            25,927 README.md
05/10/2021  01:05 PM  

In [None]:
!python dataset_tool.py --source=./datasets/keras_png_slices_data/keras_png_slices_train --dest=./datasets/oasis-stylegan-dataset.zip

100% 9664/9664 [00:19<00:00, 485.40it/s]


You may be wondering:

"Harry what are you doing!? you've saved all those images into a `.zip` folder?!"

Yes.

If you check the **Compatibility** section on styleGAN2-ada-pytorch's [Github page](https://github.com/NVlabs/stylegan2-ada-pytorch), you'll come across the following:

"*New ZIP/PNG based dataset format for maximal interoperability with existing 3rd party tools*"

As well as

"*TFRecords datasets are no longer supported — they need to be converted to the new format.*"

What this means is we no longer have to store images in the proprietary `.tfr` format in order to load images into styleGAN.

This is great because it means I can be MORE lazy (thank you giga chads at Nvidia).


# Create (progress) Folders on Google Drive
In the event that we accidentally close our browser, or the Colab runtime disconnects (because we were idle for too long), we will lose all of our training models and progress images :v .

To prevent this, the Giga chads at Nvidia have done it again.

Periodically, styleGAN2 will **pickle** our model.

We can save this pickle to Google Drive and resume training at a later date if we want.



In [None]:
mkdir /content/drive/MyDrive/COMP3710_report

mkdir: cannot create directory ‘/content/drive/MyDrive/COMP3710_report’: File exists


In [None]:
mkdir /content/drive/MyDrive/COMP3710_report/OASIS_training_data

# Time to T R A I N (the ride never ends...)
Here we summon the demons from Nvidia's basement, make a deal with them, and then have them train our model.

These demons don't speak english though, so we have to communicate with them using the following incantation:

In [None]:
!python train.py --outdir=/content/drive/MyDrive/COMP3710_report/OASIS_training_data --data=./datasets/oasis-stylegan-dataset.zip --gpus=1 --augpipe=bg --gamma=10 --cfg=paper256 --mirror=1 --snap=10 --metrics=none


Training options:
{
  "num_gpus": 1,
  "image_snapshot_ticks": 10,
  "network_snapshot_ticks": 10,
  "metrics": [],
  "random_seed": 0,
  "training_set_kwargs": {
    "class_name": "training.dataset.ImageFolderDataset",
    "path": "./datasets/oasis-stylegan-dataset.zip",
    "use_labels": false,
    "max_size": 9664,
    "xflip": true,
    "resolution": 256
  },
  "data_loader_kwargs": {
    "pin_memory": true,
    "num_workers": 3,
    "prefetch_factor": 2
  },
  "G_kwargs": {
    "class_name": "training.networks.Generator",
    "z_dim": 512,
    "w_dim": 512,
    "mapping_kwargs": {
      "num_layers": 8
    },
    "synthesis_kwargs": {
      "channel_base": 16384,
      "channel_max": 512,
      "num_fp16_res": 4,
      "conv_clamp": 256
    }
  },
  "D_kwargs": {
    "class_name": "training.networks.Discriminator",
    "block_kwargs": {},
    "mapping_kwargs": {},
    "epilogue_kwargs": {
      "mbstd_group_size": 8
    },
    "channel_base": 16384,
    "channel_max": 512,
    "n

(And use this in the event the demons speak windows)

In [None]:
!python train.py --outdir=OASIS_training_data --data=./datasets/oasis-stylegan-dataset.zip --gpus=1 --augpipe=bg --gamma=10 --cfg=paper256 --mirror=1 --snap=10 --metrics=none

2021-10-05 13:42:01.481219: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll



Training options:
{
  "num_gpus": 1,
  "image_snapshot_ticks": 10,
  "network_snapshot_ticks": 10,
  "metrics": [],
  "random_seed": 0,
  "training_set_kwargs": {
    "class_name": "training.dataset.ImageFolderDataset",
    "path": "./datasets/oasis-stylegan-dataset.zip",
    "use_labels": false,
    "max_size": 9664,
    "xflip": true,
    "resolution": 256
  },
  "data_loader_kwargs": {
    "pin_memory": true,
    "num_workers": 3,
    "prefetch_factor": 2
  },
  "G_kwargs": {
    "class_name": "training.networks.Generator",
    "z_dim": 512,
    "w_dim": 512,
    "mapping_kwargs": {
      "num_layers": 8
    },
    "synthesis_kwargs": {
      "channel_base": 16384,
      "channel_max": 512,
      "num_fp16_res": 4,
      "conv_clamp": 256
    }
  },
  "D_kwargs": {
    "class_name": "training.networks.Discriminator",
    "block_kwargs": {},
    "mapping_kwargs": {},
    "epilogue_kwargs": {
      "mbstd_group_size": 8
    },
    "channel_base": 16384,
    "channel_max": 512,
    "n

Some of these parameters might look a little confusing, so I'll explain:
- `--gpus=1`: The number of GPUs we're using (default is 1)
- `--augpipe=bg`: Augmentation pipeline. This parameter has a lot of subtleties, as it dictates what the discriminator is allowed to augment (hence the ADA part of the styleGAN2 package). the options we're using are `bg`, which mean we're enabling pixel blitting, and geometric augmentations, but disabling colour, filter, noise, and cutout.
- `--gamma=10`: Overrides R1 gamma
- `--cfg=paper256`: Sets the configuration of the output. here, the `paper256` instructs the generator to produce images at 256x256 pixels.
- `--mirror=1`: Amplifies the dataset with x-flips (in this case 1). Often beneficial, even with ADA as it introduces more variation.
- `--snap=10`: Snapshot interval, controls how many ticks between saving a snapshot to the Google Drive.
- `--metrics=none`: For each pickle, Frechet Inception Distance (FID) is evaluated and the score is logged in `metric-fid...json`. Since we don't really care how 'good' the model is, we're not going to store that data.

# And now, we wait... for hours :v
Training can take DAYS, WEEKS, or even MONTHS (imagine trying to train a discriminator for self driving cars lol)

Results will be stored in the `drive/MyDrive/COMP3710_report/OASIS_training_data` folder (images and pickles).

Each time you run the above code, it will store the results in a new directory.
(e.g. first time you run it will store the results in `00000-whateverYouNamedThisThing...`. And then the next time you run, it will store it in `00001-whateverYouNamedThisThing...`, and so on.

Inside these directories, you'll see a bunch of files.
- `real.png`: Shows a sample of the training dataset (in a nice mosaich layout).
- `fakes00000.png`: Shows a sample of the generated images produced by the generator (also in a nice mosaich layout).
- `network-snapshot-X.pkl` is the pickled model which we use to generate all those 'fake' images.  

# This is taking too long!
You'll probably reach a point where you've been waiting for a million years and the job still isn't done.

That's OK!, we can make the computer go faster by leveraging UQs **GOLIATH HPC cluster**

In order to communicate with the cluster, we need to write what's called a `SLURM` script.

A `SLURM` script is just a bash file with a bunch of instructions that the cluster reads in order to setup the environment.

If you're lazy like me, you can use [this](https://www.hpc.iastate.edu/guides/classroom-hpc-cluster/slurm-job-script-generator) nifty website in order to generate a script for you :)

Now, you might be wondering.

Where are all my files going to go?

The answer to that is Google Drive.



# A bit about SLURM
SLURM

Slurm consists of a daemon (called `slurmd`) running on each compute node, and a central daemon (called `slurmctld`) running on a management node (with an option fail-over twin, similar to webservers).

The `slurmd` daemons provide fault-tolerant hierarchical communications.

Information can be queried using several user commands:
- `sacct`: Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database ([more info](https://slurm.schedmd.com/sacct.html))
- `salloc`: Obtain a Slurb job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished [more info](https://slurm.schedmd.com/salloc.html))
- `sattach`: Attaches to a running slurm job step. By attaching, it makes available IO streams of all the tasks of a running Slurm job step. It is also suitable for use with a parallel debugger like `TotalView` ([more info](https://slurm.schedmd.com/sattach.html))
- `sbatch`: Submits a batch script to Slurm. The batch script may be given to sbatch through a file name on the command line, or if nothing is provided, sbatch will read in a script from standard input (useful for piping). The batch script is the file with all those `#SBATCH` directives at the top [more info](https://slurm.schedmd.com/sbatch.html). The script will typically contain one or more srun commands to launch parallel tasks.
- `sbcast`: Used to transmit a file to all nodes allocated to the current active Slurm job. This command should only be executed from within a Slurm batch job or within the shell spawned after a Slurm Job's resource allocation [more info](https://slurm.schedmd.com/sbcast.html). It can also be used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
- `scancel`: Used to signal (oooo CSSE2310 signals) or cancel jobs, job arrays or job steps. An Arbitrary number of jobs or job steps may be signaled using job specific filters or a space separated list of specific job and/or job step IDs [more info](https://slurm.schedmd.com/scancel.html).
- `scontrol`: used to view or modify Slurm configurations including: job, job step, node, partition, reservation, and overall system configuration (NOTE, most commands can only be executed by an administrator) [more info](https://slurm.schedmd.com/scontrol.html)
- `sinfo`: used to view partition and node information for a system running Slurm (so state information) [more info](https://slurm.schedmd.com/sinfo.html)
- `smap`: Graphically view information about slurm jobs, partitions, and set configuration parameters (nice).
- `squeue`: Used to view job and job step information for jobs managed by Slurm. This is different to sinfo as it has a wider variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order [more info](https://slurm.schedmd.com/squeue.html).
- `srun`: Run a parallel job on cluster managed by Slurm. If necessary, srun will first create a resource allocation in which to run the parallel job. It can be used to submit a job for execution, or initiate job steps in real time. It has a bunch of options for specifying resource requirements. (Also note that a job can contain **multiple jobs** executing **sequentially** or in **parallel** on independent or shared resources within the job's node allocation) [more info](https://slurm.schedmd.com/srun.html).
- `strigger`: Used to set, get or view Slurm trigger information. Triggers include events such as node failing (going down), a job reaching its time limit or a job terminating. These events can cause actions such as the execution of an arbitrary script [more info](https://slurm.schedmd.com/strigger.html).
- `sview`: Used to view Slurm configuration, job, step, node and paritions state information (all in a nice GUI). Authorized users can also modify select information (cool for debugging). Also note that this requires **GTK** to be installed, which may or may not be available on the system [more info](https://slurm.schedmd.com/sview.html).

The entities managed by these Slurm daemons include
- **nodes**: The compute resource in Slurm
- **partitions**: group nodes into logical (possibly overlapping) sets, jobs or allocations of resources assigned to a user for a specified amount of time
- **job steps**: sets of (possibly parallel) tasts within a job.

Partitions can be considered **job queues**.

Each partition is constrained by a number of factors (oh boy more lagrange stuff)
- job size limit
- job time limit
- users permitted to use it
- etc.

**Priority-ordered jobs** are **allocated nodes** within a partition until the resources (nodes, processors, memory, etc) within that partition are exhausted.

Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation.

For example:
- A single job step may be started that utilizes all nodes allocated to the job.
- Several job steps may independently use a portion of the allocation.

# CREATING A JOB
A job is formed of two sections: **resource request** and **job steps**.

**resource requests** involves specifying:
- The required number of CPUs/GPUs
- Expected job duration
- Amounts of RAM
- Disk space
- etc.

**Job steps** involve describing what needs to be done (i.e. computing steps, which software to run, parameter space, etc).

Typically a job is created via a submission script (e.g. a `.sh` script).

The very first lione of the submission file has to be the bashbang (e.g. `#!/bin/bash`). Then the next lines must be the `SBATCH` directives. Finally, you can input any other line.

For example, **comments** (a line starting with `#`) **prefixed with SBATCH** (i.e. `#SBATCH`) at the beginning of a bash script are understood by SLURM as **paramters describing resource requests and other submission options**

The script itself is a job step. Other job steps are created with the srun command.

Example (we'll call this script `submit.sh`):
```
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#
#SBATCH --ntasks=1
#SBATCH --time=60:00
#SBATCH --mem-per-cpu=200

srun hostname
srun sleep 60
```
Now was submit this job to the queue
When we hit enter, we'll get a message saying the job has been submitted, along with a job id.
```
> sbatch submit.sh
sbatch:  Submitted batch job 99999999
```

Once a job has been submitted to a queue with `sbatch`, execution will follow these steps/states:
- **PENDING**: The job then enters the queue in the PENDING state
- **RUNNING**: Once resources become available, and the job has highest priority, an allocation is created for it, and it goes to the RUNNING state.
- If the job completes correctly, it goes to the **COMPLETED** state, otherwise, it is set to the **FAILED** state.

# PARALLEL JOBS
Parallel jobs (e.g. tasks ran simultaneously) can be created via a different method (this isn't relevant for this section, but is cool to know).

Examples of multi-process jobs include:
- **A Multi-process program** (Single process, multiple data (SPMD) paradigm, e.g. with MPI)
- **A Multi-threaded program** (Shared memory paradigm, e.g. with OpenMP or pthreads)
- **Several instances of a single-threaded program**: (Embarassingly parallel paradigm or a job array)
- **One master pgoram controlling several slave programs**: (master/slave paradigm)

In the context of SLURM,
- A task represents a process
- A multi-process program is made of several tasks
- By contrast, a multi-threaded program is composed of only one task, which uses several CPUs.

Tasks are requested/created with the `--ntasks` option, while CPUs, for the multithreaded programs, are requested with the `--cpus-per-task` option.
- Tasks cannot be split across several compute nodes, so requesting several CPUs with the `--cpus-per-task` option will ensure all CPUs are allocated on the same compute node.
- By contrast, requesting the same amount of CPUs with the `--ntask` option may lead to several CPUs being allocated on several distinct compute nodes.

# SCRIPT EXAMPLES
As none of us 100% understand whats going on unless we can see patterns in the examples, here are some cool examples of SLURM scripts

# EXAMPLE 1. MPI
Let's begin with a simple MPI example: Hello world.

Wikipedia has a nice [example](https://en.wikipedia.org/wiki/Message_Passing_Interface#Example_program) of a working MPI program, so let's just copy that.
Save this code as `wiki_mpi_example.c`

Next we need to make an `sbatch script`
We can either compile the program in advance (possibly better) or compile it before running the code in the sbatch script

Because I usually forget how to write `Makefile`'s, we're going to do this in the sbatch script, however in practise, you'd write, debug and compile your program on your own PC first.

We can send this script to the queue by executing the command
```
sbatch example_mpi.sbatch
```
Our `sbatch` script will look like the following (call the script `example_mpi.sbatch`):

```
#!/bin/bash

#SBATCH --job-name=test_mpi
#SBATCH --output=res_mpi.txt

# Request 4 CPUS
#SBATCH -ntasts=4

# Request 10 minutes of compute time
#SBATCH --time=10:00

# Request 100 MiB of memory per CPU
#SBATCH --mem-per-cpu=100

# Modules that are installed on the node (similar to the modules you get via sudo apt-get install ...)
module load gcc/6.4.0
module load openmpi/3.0.0

# Compile the C program
mpicc wiki_mpi_example.c -o hello.mpi

# Launch the mpi program
srun hello.mpi
```

Then we can submit this to the queue using the above command.

# EXAMPLE 2. GPU Job
Similar idea as example 1, but the script will look like this

```
#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --output=res_gpu.out
#SBATCH --error=res_gpu.err

#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@address.com
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --ntasts-per-node=8

#SBATCH --distribution=cyclic:cyclic
#SBATCH --mem-per-cpu=7000mb
#SBATCH --partition=gpu
#SBATCH --gpus:tesla:4
#SBATCH --time=00:30:00

module purge
module load cuda/10.0.130 intel/2018 openmpi/4.0.0 vasp/5.4.4

srun --mpi=pmix_v3 vasp_gpu
```



# Enough Talk, show us some brains.
For that, you're going to have to run the
```
test_script.py
```
file.

This file uses the generator from our latest pre-trained network snapshot.

This allows us to generate nice pictures of brains on pretty normal hardware.

You can find the script in the 
```
recognition\stylegan2-ada-python\test_script.py
``` 
directory.




# Closing Notes

Hopefully you now have some cool looking brains.

Thanks for reading :)