![dash_wide_logo.png](attachment:dash_wide_logo.png)

## **Data Analytics & Science Hub** 
### **VS Code in Posit Workbench**

## **Learning Objectives**

1. How to link VS Code with GitHub.
2. How to manage Python packages and dependencies using Pixi
3. How to pull data into Jupyter Notebooks from the DASH datalake.
4. How to connect to Databricks clusters

## **What is Posit Workbench?**

Posit Workbench is a cloud computing platform which brings together many popular Integrated Development Environments (IDEs)

Microsoft Virtual Studio Code (VS Code) is one such popular IDE available on Posit Workbench

![Databricks_Diagram.png](attachment:Databricks_Diagram.png)

**Unity Catalog**

Unity Catalog is Databricks’ built-in data governance layer. It decides:

- Who can read or write data
- Which compute can be used
- Keeps an audit trail of access

![UC_diag.png](attachment:UC_diag.png)

**Datalake**

Databricks hosts data in a 'data lake house' where permissions and access are governed using the 'medallion' architecture

![datalake.png](attachment:datalake.png)

[Link](https://adb-7393756451346106.6.azuredatabricks.net/) to databricks (ensure you are on the correct workspace)

## **Why use VS Code on Posit Workbench?**

| Functionality                    | VS Code in Posit Workbench | VS Code in AVD | Databricks Notebooks |
|----------------------------------|-----------------------------|----------------|------------------------|
| VS Code Extensions Supported     | ✓                           | ✓              | ✗                      |
| Full Access to CLI Tools       | ✓                           | ✓              | ✗                      |
| Databricks Connection Preconfigured | ✓                    | ✗              | ✓                        |
| Environment & Package Management  | ✓                    | ✓                | ✗                        |

## **Getting started on Posit Workbench**

- Sign into Posit Workbench [here](https://dash-workbench-prd.azure.defra.cloud/)

![1.png](attachment:1.png)

**Creating a VS Code session**
- Click **```New Session```** (either button will work)

![2.png](attachment:2.png)

- Select **```VS Code```**
- Click the down arrow on the right hand side of the popup and select the appropriate workspace: 
    - **```PRDDAPINFDB2415```** for Environment Agency
    - **```PRDDAPINFDB2416```** for Core DEFRA
    - **```PRDDAPINFDB2417```** for Natural England
    - **```PRDDAPINFDB2418```** for other arms-length bodies
- Ensure that there is a tick on the left of the popup window under **```Session Credentials```**. If not, click the padlock. A tick should appear shortly after

![3.png](attachment:3.png)

- Click **```Start Session```** and wait until the session becomes active

![4.png](attachment:4.png)
![5.png](attachment:5.png)

- Close the **```Getting Started```** tab (the tab within VS Code, not our browser tab)

![github_verify.png](attachment:github_verify.png)


## **Using GitHub**

- Nagivate to the **```Explorer```** on the left hand side (the button which looks like two sheets of paper)
- Click on **```Clone Repository```**
- **IMPORTANT: Only clone a repo if it is your first time using that repo. If you have already cloned it, choose ```Open Folder``` instead**

![6.png](attachment:6.png)

- Select **```Clone from GitHub```** from the popup at the top of the screen
- Click **```Allow```**

![github_verify2.png](attachment:github_verify2.png)

- Click **```Copy & Continue```**

![github_verify3.png](attachment:github_verify3.png)

- Click **```Open```**

![github_verify5.png](attachment:github_verify5.png)

- Paste the 8-digit code in using **```Ctrl + V```** and click **```Continue```**
- Click **```Authorize```**

![github_verify4.png](attachment:github_verify4.png)

- Confirm access to GitHub using your password
- After authentication, close the GitHub browser tab and return to VS Code
- Paste the following address in the top bar in VS Code:

**```https://github.com/Defra-Data-Science-Centre-of-Excellence/DASH-Training-VSCode```**

![7.png](attachment:7.png)

- Then set the repository destination 

**IMPORTANT: Ensure that the destination is ```mnt/workbench/home/<YOUR.NAME>```**

![8.png](attachment:8.png)

- Then open the cloned repo

![9.png](attachment:9.png)

- Click **```Yes, I trust the authors```** 

![trust_author.png](attachment:trust_author.png)

**Using the terminal**

By default, VS Code will set the recently cloned repo as the current working directory. To check this, open a terminal with the shortcut **```Ctrl + Shift + '```** where the current working directory can be seen in blue text

![terminal_cwd.png](attachment:terminal_cwd.png)

- If you want to use the terminal to clone git repos rather than the method above, open a new terminal using 

**```Ctrl + Shift + '```**  

- Check your current working directory with **```pwd```**

- Navigate to **```/mnt/workbench/home/<YOUR.NAME>```** and paste the following:

**```git clone https://github.com/Defra-Data-Science-Centre-of-Excellence/DASH-Training-VSCode.git/```**

Any changes made to the repo are automatically tracked in the **```Source Control```** tab on the left

### **Now, double click ```VSCode.ipynb``` in the ```Explorer``` to open it**

## **Python package management**

**Q: What is a Python environment?**

*A: Python environments are isolated Python versions and packages*

**Q: What is a Package Manager?**

*A: A tool which installed Python packages into your Python environment - like an app store for Python*

**Pixi**

This tutorial uses Pixi as the environment and package manager of choice

| Functionality                                        | Pixi | pip | uv  | Poetry | venv |
|-------------------------------------------------|------|-----|-----|--------|------|
| Manages Python Environments                   | ✓    | ✗   | ✗   | ✗      | ✓    |
| Manages Packages                              | ✓    | ✓   | ✓   | ✓      | ✗    |
| Multi-language Support (e.g. Python, R, Julia) | ✓    | ✗   | ✗   | ✗      | ✗    |
| Reproducible Environments                     | ✓    | ✗   | ✓   | ✓      | ✗    |
| Runs Scripts/Tasks (Task Runner)              | ✓    | ✗   | ✗   | ✓      | ✗    |

Further information on Pixi can be found [here](https://pixi.prefix.dev/latest/first_workspace/).
Other options for a Python package manager include [uv](https://docs.astral.sh/uv/) and [Poetry](https://python-poetry.org/docs/)

**Installation**
- To install Pixi, open the terminal using **```Ctrl + Shift + '```** and run the code below
- To paste code into the terminal, right click

**IMPORTANT: Ensure that you have clipboard permissions. Check your browser settings**



![12.png](attachment:12.png)

- After installation, restart the shell by clicking on the bin icon on the right, then reload the terminal with **```Ctrl + Shift + '```**

![restart_terminal.png](attachment:restart_terminal.png)

- Update Pixi:
**```pixi self-update```**

![13.png](attachment:13.png)

- Initialise Pixi:
**```pixi init```**:

This will create some folders and files in your working directory. This is because unlike venv and conda which make global Python environments, Pixi creates *localised* Python environments.

Pixi does this by detecting the **```pixi.toml```** file which is automatically generated when running **```pixi init```**, and automatically loads the dependencies listed in that file

*N.B. .toml files are configuration files which contain information about packages and other dependencies*

![14.png](attachment:14.png)

**Adding packages and dependencies**
- With the **```Explorer```** open on the left, double click on the **```pixi.toml```** file
- Copy the text below 
- Paste into your **```pixi.toml```** file replacing **```[dependencies]```** onwards

*N.B. Install packages as pypi dependencies as this passes through DASH's custom [package manager](https://dash-pkmgr-prd.azure.defra.cloud/client/#/). Pixi's default channel is Conda which bypasses the package manager*


- **IMPORTANT: Do not change the [workspace] or [tasks] sections**
- Here is what the **```pixi.toml```** file should now look like:

![pixi_toml.png](attachment:pixi_toml.png)

- Now close pixi.toml (no need to save)
- Run **```pixi install```** in the terminal

![16.png](attachment:16.png)

- And that's it! We now have a fully working Python environment via Pixi!
- We can view our dependencies by running **```pixi list```**

![17.png](attachment:17.png)

## **Setting up Python and other extensions**

VS Code comes with a built in *app-store* of things called extensions. These are tools which users can integrate into their workflow which help with development i.e. debuggers, AI tools, language support, etc. In order to code in Python in VS Code, the Python extension will need to be installed.

**Allowing permissions for installing extensions**

- Click on **```Manage```** (the gear icon in the bottom left)
- Click on **```Settings```** 
- Click on the **```Remote [dash-workbench-prd.azure.defra.cloud]```** path
- Search **```ssl```**
- Click on **```Proxy```** which can be found under **```Application```**
- Untick **```Http: Proxy Strict SSL```**
- Now close the settings tab

![configure_extensions.png](attachment:configure_extensions.png)

**Installing Python & Jupyter to start a Python kernel**

A kernel is the engine on which code in a Jupyter Notebook is run. If a package is installed onto our Python environment and that environment is then selected as our kernel, that package will be a part of our *code engine* meaning the full functionality of that package will be available to use in the Jupyter Notebook.

- Run the Python cell below


In [None]:
# Press Shift + Enter to run this cell
print("Hello World!")

- At the top of the screen, click on **```Install/Enabled suggested extensions```**
- Select **```Python Environments```** (*N.B. If you see **```Select Another Kernel```** instead, click that first*)
- Select **```default (Python 3.11.11)```**
- Python cells in your Jupyter Notebook can now be run!

*N.B. If **```default (Python 3.11.11)```** is not visible, point the Python interpreter to the right location by using **```Ctrl + Shift + P```** then navigating to **```.pixi/envs/default/bin/python3.11```***

In [None]:
print("Now we have a working Python kernel!")

In [None]:
2+2

A local Python environment has now been configured. Packages listed [here](https://dash-pkmgr-prd.azure.defra.cloud/client/#/) can be installed via the PyPI channel. Consult the [Pixi documentation](https://pixi.prefix.dev/dev/python/tutorial/#managing-both-conda-and-pypi-dependencies) for how to install packages via the terminal. 

**Configuring Databricks Extension**
- Click on the **```Extensions```** tab on the left or type **```Ctrl + Shift + x```**

![extensions_icon.png](attachment:extensions_icon.png)

- Search for **```databricks```**
- Click **```Install```** on the top result, then click **```Trust Publisher and Install```**
- If the tabs for these new extensions aren't visible, click on the elipsis on the left and select Databricks

![db_extension.png](attachment:db_extension.png)

- Click **```Create configuration```**
- Select **```Profiles: workbench```**
- Select **```workbench```**
- Select the cluster **```RStudio_15.4LTSML```**
- **IMPORTANT: You may see a prompt in the bottom right to restart Jupyter. If you do, click ```Restart All Jupyter Kernels```**

![22.png](attachment:22.png)

- In the top right of the Jupyter Notebook there will now be a red databricks button which runs the entire notebook through the selected databricks cluster

![databricks_button.png](attachment:databricks_button.png)

## **Getting data from Unity Catalog into VS Code**

In [None]:
# Connect to workspace
from databricks.sdk import WorkspaceClient
import geopandas as gpd
import folium

from folium.plugins import Fullscreen
from io import BytesIO

w = WorkspaceClient(profile="workbench")

If the code above doesn't work and gives an error, reload the page and run the cell again

![reload.png](attachment:reload.png)

In [None]:
# Path to Marine Conservation Zones dataset
path = "/Volumes/prd_dash_bronze/natural_england_open_data_geoportal_unrestricted/marine_conservation_zones/format_GEOPARQUET_marine_conservation_zones/LATEST_marine_conservation_zones/Marine_Conservation_Zones___Natural_England_and_JNCC.parquet"

raw = w.files.download(path).contents.read()

gdf = gpd.read_parquet(BytesIO(raw))

In [None]:
# Render parquet file using Folium
gdf_wgs84 = gdf.to_crs(epsg=4326)

gdf_wgs84["geometry"] = gdf_wgs84.geometry.simplify(0.0005, preserve_topology=True)

pts = gdf_wgs84.geometry.representative_point()
center = [pts.y.mean(), pts.x.mean()]

m = folium.Map(location=center, zoom_start=6, tiles="CartoDB positron")

folium.GeoJson(
    gdf_wgs84,
    name="MCZ",
    tooltip=folium.GeoJsonTooltip(
        fields=["MCZ_NAME", "MCZ_CODE", "STATUS"],
        aliases=["Name:","Code", "Status:"]
    ),
    style_function=lambda feat: {
        "weight": 1,
        "fillOpacity": 0.4,
    },
).add_to(m)

Fullscreen(
    position="topright",
    title="Full Screen",
    title_cancel="Exit Full Screen",
    force_separate_button=True
).add_to(m)

folium.LayerControl().add_to(m)

m

If the code cell gives the error: **```Make this Notebook Trusted to load map: File -> Trust Notebook```**, close VSCode.ipynb and reopen it, then rerun the cell above

In [None]:
# Enter your name
YOUR_NAME = "" # Use the format "Joe.Bloggs"

In [None]:
# Create a folder in Unity Catalog
w.files.create_directory(f"/Volumes/prd_dash_lab/databricks_training_unrestricted/training/VSCode/{YOUR_NAME}")

In [None]:
# Write file to Unity Catalog Volume
buf = BytesIO()
gdf.to_parquet(buf, index=False)
buf.seek(0)

w.files.upload(f"/Volumes/prd_dash_lab/databricks_training_unrestricted/training/VSCode/{YOUR_NAME}/Marine_Conservation_Zones___Natural_England_and_JNCC.parquet", buf, overwrite=True)

In [None]:
# Deleting the file from Unity Catalog
w.files.delete(f"/Volumes/prd_dash_lab/databricks_training_unrestricted/training/VSCode/{YOUR_NAME}/Marine_Conservation_Zones___Natural_England_and_JNCC.parquet")

In [None]:
# Deleting the directory from Unity Catalog
w.files.delete_directory(f"/Volumes/prd_dash_lab/databricks_training_unrestricted/training/VSCode/{YOUR_NAME}")

Further documentation on how to use Unity Catalog with Python can be found [here](https://docs.databricks.com/aws/en/dev-tools/sdk-python#manage-files-in-unity-catalog-volumes)

**Configuring Databricks Power Tools Extension**
- On the extensions tab, seach for **```databricks power tools```**
- Click **```Install```** on the top result, then click **```Trust Publisher and Install```**
- Reload the webpage

![reload.png](attachment:reload.png)

- Click on the Jupyter Notebook kernel which currently reads **```default (Python 3.11.11)```** on the top right and click **```Select Another Kernel```**
- Select **```Databricks RStudio_15.4LTSML```**
- Now our kernel *is* the databricks cluster, rather than a local Python environment we made with Pixi