Skip to content

Latest commit

 

History

History
445 lines (345 loc) · 11.6 KB

content.md

File metadata and controls

445 lines (345 loc) · 11.6 KB

High-Performance Computing at the University of Sheffield

Will Furnass

Research Software Engineering team, University of Sheffield

https://rse.shef.ac.uk/rse-dcs-pres-on-hpc/


Outline

  1. What is HPC? Why bother?
  2. HPC at TUOS
  3. Using HPC
  4. Further resources

Personal machines for research

Lenovo Thinkpad

  • Edit and run code in one place
  • ~4 cores - some parallelism
  • Full control!

but...


Limitations of personal machines

  • Limited RAM
  • Basic CPU
  • Limited number of cores
  • Limited, fragile storage
  • Modest GPU
  • Limited network connection

Limitations of personal machines

  • How to distribute work between laptops?
  • How to run a series of tasks overnight?
    • "1am: Check if run 3 finished & start run 4"
    • "3am: Check if run 4 finished & start run 5"
    • "5am: Check if run 5 finished & start run 6"

HPC: how is it different?

'HPC': a computer cluster with:

  • Computing resources
    • many nodes
    • each with many cores, much RAM, maybe GPUs
    • connected by fast networking
  • Storage
    • both shared and per-node
    • resilient and fast
  • Job management
    • Queue up jobs to run over a week
  • Command-line as default interface
  • Linux OS + optimised research software

TUOS Clusters

  • Bessemer:
    • Newest hardware
    • Best for single-node jobs
  • ShARC:
    • Good for multi-node jobs as has high-bandwidth, low-latency interconnects

DCS HPC resources

  • Majority of nodes public (free at point of use)
  • But DCS has some private nodes in Bessemer and ShARC:
    • Non-std hardware specs
    • Less contention (sometimes!)

Bessemer DCS node specs


ShARC DCS group nodes


Cluster structure

Cluster diagram


Jupyter

Can also run Jupyter Notebooks on cluster!

Ask for more info


Moving data to/from HPC

  • SSH-based methods are your friends here (rsync, scp, sftp)
  • Or
    • Use a storage area directly accessible to both your local machine and HPC?
    • Just use HPC?

Storage

Location Shared? Quota Backups? Multi-HPC
/home/$USER 10 GB
/data/$USER 100 GB
/fastdata/$USER - ✓'
/scratch -
/shared/$PROJNAME 10 TB

Storage

Location Remote access? Speed Suited to
/home/$USER SSH > Pers data
/data/$USER SSH > Pers data
/fastdata/$USER SSH >>> Tmp big files
/scratch - >>> Tmp small files
/shared/$PROJNAME SSH + CIFS > Proj files

Running jobs

  • Users submit jobs to a job scheduler
    • e.g. Slurm (Bessemer) or SGE (ShARC)
    • A distributed resource manager
    • Not intuitive!
    • V. powerful --
  • Request
    • Interactive or batch job
    • Run time (e.g. 2h or 4d)
    • Computational resources (cores, RAM, GPUs)
    • Access to private resources
    • Notifications --
  • Type of job
    • Interactive sessions (if resources are available)
    • Batch jobs (submit job to a queue)

Example interactive session

[me@mylaptop ~]$ ssh te1st@bessemer.sheffield.ac.uk
...
[te1st@bessemer-login1 ~]$ srun \
  --partition=dcs-gpu-test \
  --account=dcs-res \
  --cpus-per-gpu=4 \
  --mem-per-cpu=2G \
  --gpus=1 \
  --pty \
  /bin/bash

[te1st@bessemer-node030 ~]$ ./my_simulation_program --num-cores=4
...

Example batch job script (Bessemer/Slurm)

Create a shell script, my-job-script.slurm:

#!/bin/bash
#SBATCH --partition=dcs-gpu
#SBATCH --account=dcs-res
#SBATCH --cpus-per-gpu=4
#SBATCH --mem-per-gpu=2G
#SBATCH --gpus=4
#SBATCH --mail-user=me@sheffield.ac.uk

./my_simulation_program --num-cores=16

Copy this file to Bessemer then log on to Bessemer and submit this to Slurm:

[me@mylaptop ~]$ ssh te1st@bessemer.sheffield.ac.uk
[te1st@bessemer-login1 ~]$ sbatch my-job-script.slurm

Now go home for dinner!


After submitting a job

You can then:

  • Wait for an email notification
  • Check status (running/queueing)
  • Cancel/amend job

Resource estimation

  1. Run short test jobs
  2. View resource utilisation
  3. Extrapolate
  4. Submit larger jobs

Software on TUOS HPC

Centrally-installed, optimised software

  • Compilers, libraries, apps, dev tools etc

  • Activate a package by loading a modulefile e.g.

    module use $MODULENAME

Where, for e.g. cuDNN, $MODULENAME could be one of:

libs/cudnn/4.0/binary-cuda-7.5.18
libs/cudnn/5.1/binary-cuda-7.5.18
libs/cudnn/5.1/binary-cuda-8.0.44
libs/cudnn/6.0/binary-cuda-8.0.44
libs/cudnn/7.0/binary-cuda-8.0.44
libs/cudnn/7.0/binary-cuda-9.1.85
...

Software on TUOS HPC

Manage your own software

Several options:

  • Install non-optimised binary packages in e.g. your home directory
    • Conda
  • Build optimised software stacks from source
    • Spack, EasyBuild
  • Run containers
    • Singularity - similar to Docker

All useful for e.g. provisioning/using complex Deep Learning software stacks!


Optimisation and parallelisation

  • Laptop may be faster than single-core job on HPC:

    • CPUs in servers run at lower clock speeds
    • .exe may not exploit advanced CPU features
  • Performance often comes from >=1 of:

    • Optimising for CPU architecture
    • CPU parallelism (multiple cores, multiple nodes)
    • Accelerators (GPUs, TPUs, Xeon Phi etc)

Optimising for CPU architecture

  • In Bessemer and ShARC the CPUs support
    • hardware vectorisation (same instruction applied to multiple elements in memory)
    • fused add-multiply (useful for matrix multplication)
  • Either use pre-compiled libraries that can dynamically use these
    • e.g. Intel Math Kernel Library (MKL)
  • or compile to produce builds optimised for those CPUs

CPU parallelism

(At least) 5 flavours:

Shared memory

  • Single node
  • Multiple CPU cores
  • Typically 1 thread per core
  • Thread-local and shared variables
  • Many applications do this via OpenMP and/or Intel MKL

Distributed memory (single node)

  • Multiple CPU cores
  • Typically 1 process per core
  • Separate address spaces
  • Data (and code?) passed between processes
  • e.g. joblib w/ multiprocessing or ipyparallel; MATLAB parfor; R parallel/foreach

Distributed memory (multiple nodes)

  • Multiple CPU cores per node (symmetric?)
  • Typically 1 process per core
  • Separate address spaces per process
  • Data (and code?) passed between processes
    • within a node
    • between nodes
  • V. fast interconnects between nodes
  • Facilitated by
    • MPI (API + software for exploiting fast interconnects)
    • Apps/libs that understand MPI (ipyparallel, PETSc, MATLAB DCE)

Task arrays

  • Set of near identical tasks
  • Embarrassingly parallel
  • Scheduled separately
  • Which then packs out its schedule with them!
  • Great for sensitivity analyses

Accelerators, specifically CUDA

  • Massive data parallelism
  • Very effective for linear-algebra-heavy ops
    • ML, DL
  • Can either write low-level code in CUDA
  • Or use higher-level libs that speak CUDA
    • Tensorflow, PyTorch etc for DL

Other options

High-level APIs for working with large datasets, possibly out of core:

  • Spark
  • Dask

Going bigger!

  • Potential issues
    • Jobs too big / queue times too long for ShARC?
    • Want newer GPUs/processors?

  • Options
    • JADE/JADE2: Tier 2 HPC facilities for Deep Learning
      • JADE: 22x DGX-1 systems: 22x 8x NVIDIA V100 cards (NVLINK between GPUs in nodes)
      • JADE2 (pilot phase): similar to JADE but with 63 nodes instead of 22
    • Bede: new N8 Tier 2 HPC facility for distributed DL/ML
      • 32x IBM AC922 nodes (2x POWER9 CPU; 4x V100 GPU; NVLINK between GPUs and CPUs)
      • 4x IBM IC922 'inference' nodes with T4 GPUs
      • 100Gbps Infiniband EDR interconnects
      • Better suited to hybrid CPU+GPU codes and scaling to multiple nodes than JADE/JADE2

  • Options (continued)

Learning more / getting help


Learning more / getting help

  • IT Services' helpdesk
  • Talks
    • LunchBytes talks
  • Code Clinic
    • Book an appointment to get help with a coding issue
  • Hire an RSE to help with your project(s)!
    • Either as part of a grant proposal
    • Or just for a few days

For more info (inc. mailing list and events schedule) see https://rse.shef.ac.uk/.


The RSE team

  • 13.5 RSEs
  • Team kick-started by 2x EPSRC RSE fellowships
  • Based in Computer Science
  • but work closely with IT Services
  • Some current and recent projects:
    • High-performance agent-based modelling (CUDA)
    • Deep learning and workflows for NLP
    • MRI image alignment (registration) software (C++/PETSc)
    • Agile web apps for visualising datasets (R/Shiny)
    • Augmenting cell modelling software (C++)

Getting in touch

 https://rse.shef.ac.uk

 rse@sheffield.ac.uk

 @RSE-Sheffield

 @RSE_Sheffield