# R in Triton

## Running R in interactive session

R is installed as a module in Triton. This means that you need to load it with

```sh
module load r
```

before running `R`. The base `r`-module does not contain e.g. `tidyverse`-packages. To load the environment used in this course use

```sh
module load r-triton
```

It contains some of the most popular packages.

After loading the environment you can start the R-interpreter with

```sh
R
```


## Installing packages in Triton

By default R tries to install libraries into system paths and after that fails it uses your home-folder. This is of course problematic as the quota in your home-folder is quite low. To counteract this you need to specify your own library path.

You can do it by creating a folder in your work folder with

```sh
mkdir /scratch/work/tuomiss1/Rlibs
```

and adding the following lines to two configuration files in your home folder `~/`

`~/.Renviron`:
```sh
R_LIBS_USER=/scratch/work/tuomiss1/Rlibs/%V
```

`~/.Rprofile`:
```sh
.libPaths(c(Sys.getenv("R_LIBS_USER"), .libPaths()))
```

## Running R in a Slurm job

Running R through Slurm is quite easy. One simply needs to write a wrapper script like `R_submit.slrm`

```sh
#!/bin/bash
#SBATCH -p short
#SBATCH -t 00:05:00
#SBATCH -n 1
#SBATCH --mem=100
#SBATCH -o script.R
module load r-triton
srun Rscript --vanilla script.R
```

and submit it with `sbatch R_submit.slrm`.

`Rscript` is a wrapper to `R CMD BATCH` that makes running R scripts easier as it pipes output to stdout.

To minimize loading time it does not load some base packages. One can give a list of additional packages to load with `--default-packages=list`.

# Parallel runs with Triton

## Big datasets that take long to analyze

Check out [this listing](https://cran.r-project.org/web/views/HighPerformanceComputing.html) of various packages that can be used to run R programs with multiple cores.

Some you should probably look up are probably:

1. `parallel` - parallel apply-functions.
2. `multidplyr` - Tidyverse-style mapping functions.
3. `data.table` - It has it's own mapping-style functions with OpenMP parallelism

## Lots of independent datasets

If your problem is that you have lots of data that you need to run through some analysis program then the recommended way of going through this is to:

1. Pre-prosess the data as close to the actual data (interactively/with Slurm job)
2. Store data in good sized chunks as .Rds or .feather file
3. Create a Slurm array job that goes through the data based on `SLURM_ARRAY_TASK_ID`
4. Use `Sys.getenv("SLURM_ARRAY_TASK_ID")` to get the task ID during program runtime
5. Let each worker analyze part of the data and write output as .Rds or .feather
6. Post-process the data (interactively/with Slurm job)

Example `R_submit_array.slrm` would be something like

```sh
#!/bin/bash
#SBATCH -p short
#SBATCH -t 00:05:00
#SBATCH -n 1
#SBATCH --mem=100
#SBATCH -o script.R
#SBATCH --array=1-5
module load r-triton
srun Rscript --vanilla script.R
```

One can use `slice` to separate a dataset into chunks [[ntile]](http://dplyr.tidyverse.org/reference/slice.html).

# Exercise time:

Do the previous exercise described below as a Slurm array job described above. Split the data into two groups. Remember that .feather does not allow for nested objects, so unnest before saving if you're using it.

Do this exercise to `storms`-dataset initialized below that is a subset of NOAA Atlantic hurricane database [[storms]](http://dplyr.tidyverse.org/reference/storms.html).

1. Group the dataset based on `name`. Nest the data.
2. Use map to calculate the minimum pressure, maximum wind speed and maximum category for each storm. Store these to the object. Unwind them into variables.
3. Plot a scatterplot with x-axis showing minimum pressure, y-axis showing maximum wind speed and colour showing maximum category.

In [18]:
library(tidyverse)

data(storms)

str(storms)

storms_nested <- storms %>%
    mutate_at(vars(name),as.factor) %>%
    arrange(name) %>%
    group_by(name) %>%
    nest()

storms_nested %>%
    slice(1:2)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	10010 obs. of  13 variables:
 $ name       : chr  "Amy" "Amy" "Amy" "Amy" ...
 $ year       : num  1975 1975 1975 1975 1975 ...
 $ month      : num  6 6 6 6 6 6 6 6 6 6 ...
 $ day        : int  27 27 27 27 28 28 28 28 29 29 ...
 $ hour       : num  0 6 12 18 0 6 12 18 0 6 ...
 $ lat        : num  27.5 28.5 29.5 30.5 31.5 32.4 33.3 34 34.4 34 ...
 $ long       : num  -79 -79 -79 -79 -78.8 -78.7 -78 -77 -75.8 -74.8 ...
 $ status     : chr  "tropical depression" "tropical depression" "tropical depression" "tropical depression" ...
 $ category   : Ord.factor w/ 7 levels "-1"<"0"<"1"<"2"<..: 1 1 1 1 1 1 1 1 2 2 ...
 $ wind       : int  25 25 25 25 25 25 25 30 35 40 ...
 $ pressure   : int  1013 1013 1013 1013 1012 1012 1011 1006 1004 1002 ...
 $ ts_diameter: num  NA NA NA NA NA NA NA NA NA NA ...
 $ hu_diameter: num  NA NA NA NA NA NA NA NA NA NA ...


name,data
AL011993,"1993 , 1993 , 1993 , 1993 , 1993 , 1993 , 1993 , 1993 , 5 , 5 , 6 , 6 , 6 , 6 , 6 , 6 , 31 , 31 , 1 , 1 , 1 , 1 , 2 , 2 , 12 , 18 , 0 , 6 , 12 , 18 , 0 , 6 , 21.5 , 22.3 , 23.2 , 24.5 , 25.4 , 26.1 , 26.7 , 27.8 , -84 , -82 , -80.3 , -79 , -77.5 , -75.8 , -74 , -71.8 , tropical depression, tropical depression, tropical depression, tropical depression, tropical depression, tropical depression, tropical depression, tropical depression, 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 25 , 25 , 25 , 25 , 30 , 30 , 30 , 30 , 1003 , 1002 , 1000 , 1000 , 999 , 999 , 999 , 999 , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA , NA"
AL012000,"2000 , 2000 , 2000 , 2000 , 6 , 6 , 6 , 6 , 7 , 8 , 8 , 8 , 18 , 0 , 6 , 12 , 21 , 20.9 , 20.7 , 20.8 , -93 , -92.8 , -93.1 , -93.5 , tropical depression, tropical depression, tropical depression, tropical depression, 1 , 1 , 1 , 1 , 25 , 25 , 25 , 25 , 1008 , 1009 , 1010 , 1010 , NA , NA , NA , NA , NA , NA , NA , NA"
