# Farming Stats RStudio Cluster

## Notebook overview
This notebook is used to automate turning the Farming Stats Shared RStudio cluster on and off. It is run by to jobs in the databricks workflow. The first job is set up to trun the cluster on at 06:00 every Monday, this should ensure the cluster is back up and running by 07:00 when staff begin to arrive. The second job is set to turn the cluster off again at 19:00 every Friday, when all staff *should* have finished their work. 

### Key components of the notebook

#### Personal Access Token
To manage a cluster, you must have a databricks Personal Access Token (PAT). The PAT is stored seperately in a ".env" file and then pulled into the notebook (see chunk 8). For this notebook, the .env file is stored in Josh Moatt's personal workspace. The .env file specifies the PAT as "DATABRICKS_TOKEN = my_token". Keeping the .env file seperate to this repo helps keeps the PAT secure by avoiding it being accessible to everyone and by not keeping it in the git repo. This should not impact the ability of other admins to run the jobs to turn the cluster on and off, cos once the job is created admins can run it as the creator (i.e. Josh) so the job will have the necessary permissions.

#### Cluster info
For the code to work, you must specify which cluster is to be turned on and off. To do this you specify two variables: 

* `workspace_url` - the URL for the workspace containing the cluster. This is likely to be the same used as here, as we are all using the UC workspace.

* `cluster_id` - the ID of the cluster you want to manage. This can be found from the URL of the cluster with the text coming after"cluster/". e.g. if the url is "https://adb-2353967604677522.2.azuredatabricks.net/compute/clusters/0000-123456-abcdefg1?o=2353967604677522" then the cluster ID is "0000-123456-abcdefg1".

#### Parameters
The parameter controlling the code is `task` - set in block 10. The default value for `task` is "turn_on", which cause the function `w.cluster.start` to be run, which will turn on the cluster (see block 12). When `task` is set to "turn_off" the function `w.clusters.delete` will be run, which turns the cluster off (see block 12). 

### How the notebook runs
The notebook uses python and is paramterised to execute code depending on which job is being run. The steps of the notebook are as follows:

1. install dotenv
2. import packages/functions
3. read in PAT and assign to `token`
4. set `workspace_url` and `cluster_id`
5. set parameters to `task`. Default is "turn_on", alternative is "turn_off"
6. access workspace/cluster
7. execute code depending on `task`.
 


### Automating notebook through a job
To automate the execution of this code, the notebook is run through a job using databricks workflows. It uses two seperate jobs, one to turn on the cluster and on to turn off the cluster, how the code is executed depends on the paramters `task` which can be se in the job. 

To set up the job to run this notebook, you do the following:

1. In databricks, got to workflows in the left hand menu.
2. Click the blue "Create job" button in the top right corner.
3. At the top right of the screen, where it says "New Job....", click to edit the job title - this is the job will appear in the jobs list. 
4. In the "Task name" box, add your taks name (best practice is to use snake case e.g. "turn_on_cluster")
5. In the "Path" box, insert the full path to your notebook (e.g. "/Workspace/farming_stats/config/farming-stats-cluster-management/farming-stats-rstudio-cluster").
6. Ensure "compute" is set to Serverless.
7. If creating a job to turn the cluster on, the default parameter value is "turn_on" so you can skip this step. If creating a job to turn the cluster off, click the blue "+ add" button next to parameters. In "Key" enter the name of your parameter variable - "task". In value enter your parameter value - "turn_off".
8. Click the blue "create task" button.
9. Under "Schedules & Triggers" on the right hand panel, click "Add trigger", if prompted to save the job, do so. 
10. In the drop down that appears, select "Scheduled" then "Advanced". You can now set this to execute the code whenever you desire, e.g. every Week on Monday at 06:00 etc. Once happy click the blue "Save" button. 
11. Under "Job notification" click edit notifications. You can use this to set up the email alerts you want to recieve - I went for alerts **if** the code fails to run. 
12. Save the job. If you want to test it, you can always click the blue "run now" button in the top right (it's a good idea to test these when convenient).



In [0]:
pip install python-dotenv

In [0]:
import os
from databricks.sdk import WorkspaceClient
from dotenv import load_dotenv

In [0]:
# read .env file
_ = load_dotenv("/Workspace/Users/joshua.moatt@defra.gov.uk/admin/.env")

# pull my token 
token = os.getenv('DATABRICKS_TOKEN')

In [0]:
# workspace URL
workspace_url = "https://adb-2353967604677522.2.azuredatabricks.net"

# cluster ID
cluster_id = "0313-132909-hsaqokf5"

In [0]:
# create parameter
dbutils.widgets.text("task", "turn_on")

# assign parameter to task
task = dbutils.widgets.get("task")

In [0]:
w = WorkspaceClient(
    host = workspace_url,
    token = token
)

In [0]:
if task == "turn_on":
    w.clusters.start(cluster_id = cluster_id)
elif task == "turn_off":
    w.clusters.delete(cluster_id = cluster_id)
else:
    print("Invalid task")