# 🎼 **Lab 2 - Orchestrating Spark**
In this module, we will explore how to orchestrate Spark workloads using Data Factory, Fabric Scheduler, and built in orchestator functions. Additionally, we will also explore how to use resource files to make code more modular.

## 🎯 What You'll Learn 

By the end of this lab, you'll gain insights into:  

- Reference Notebook via ```%run```
- Reference Notebook via ```notebookutils.notebook.run```
- Reference multiple Notebooks via ```notebookutils.notebook.runMultiple```
- How to use Notebook resources
- How to add Notebooks into pipelines
- Running Notebooks in a High Concurrency (HC) Session
- Scheduling notebook with the Fabric Scheduler
---

**Get Ready to Code!**
Now that you have an overview, let's get started with hands-on exercises! 🚀


## 🚧 **2.1 Create and Prepare Notebooks**
### **2.1.2 Create Child Notebooks**
Create two notebooks: `childNotebook1` and `childNotebook2`. You can create them manually or download and upload the prebuilt versions:
- [childNotebook1.ipynb](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_lab_materials/childNotebook1.ipynb)
- [childNotebook2.ipynb](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_lab_materials/childNotebook2.ipynb)

<details> <summary><strong>🔑 childNotebook1:</strong> Click to reveal code</summary>

```python
# Code cell 1, marked as parameters cell
parameter1 = ''
parameter2 = ''

# Code cell 2
print(f'This is child notebook with parameter1 = {parameter1}, parameter2 = {parameter2}')

# Code cell 3
# Return the function with exit value
notebookutils.notebook.exit(f'Exit with current Notebook Name: {mssparkutils.runtime.context["currentNotebookName"]}')

```
</details>

<details> <summary><strong>🔑 childNotebook2:</strong> Click to reveal code</summary>

```python
# Code cell 1, marked as parameters cell
input1 = ''
input2 = ''

# Code cell 2
print("cell1 in childNotebook2")
print(f'input1 = {input1}\ninput2 = {input2}')

# Code cell 3
# Return the function with exit value
notebookutils.notebook.exit(f'Exit with current Notebook Name: {mssparkutils.runtime.context["currentNotebookName"]}')
```
</details>


## 🔗 **2.2 Run and Chain Notebook**
### **2.2.1 Inject a Notebook with _%run_**
Use [`%run`](https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook#reference-run) to inject another Notebook's code into the current session:

```python
%run childNotebook1 { 'parameter1': 'value1', 'parameter2': 'value2' }
```
You can also reference Python or SQL files from Notebook or Environment resource folders:

```python
%run [-b/--builtin | -e/--environment | -c/--current] script_file.py/.sql [variables ...]
```

`%run` options:
- `-b` / `--builtin`: Built-in notebook resources
- `-e` / `--environment`: Environment resources
- `-c` / `--current`: Always uses the current Notebook's resources, even if the current Notebook is referenced by other Notebooks

📌 **Challenge:** Use `run%` to run the code from **childNotebook2** 


In [None]:
%run

<details>
  <summary><strong>🔑 Answer:</strong> Click to reveal</summary>

```python
%run childNotebook2 { 'input1': 'foo', 'input2': 'bar' }
```

</details>

## **2.2.2 Run a Notebook Programmatically with _notebookutils.notebook.run_**
The [```notebookutils.notebook.run```](https://learn.microsoft.com/en-us/fabric/data-engineering/notebook-utilities#reference-a-notebook) function references a notebook and returns its exit value. You can run nesting function calls in a notebook interactively or in a pipeline. The notebook being referenced runs on the Spark pool of the notebook that calls this function. In comparison to `%run`, this method shows up as a distinct job with a Notebook snapshot avalable in the Monitoring hub.

```python
notebookutils.notebook.run("notebook name", <timeoutSeconds>, <parameterMap>, <workspaceId>)
```

![nbutils.run](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/Reference%20notebook%20via%20nbutils.jpg?raw=true)

### **2.2.3 Reference multi notebooks via _notebookutils.notebook.runMultiple_**
The [`notebookutils.notebook.runMultiple`](https://learn.microsoft.com/en-us/fabric/data-engineering/notebook-utilities#reference-run-multiple-notebooks-in-parallel) function allows you to run multiple notebooks in parallel or with a predefined DAG (directed-acyclic-graph). The API executes the child notebooks similar to high-concurrency mode as the same spark session is used so that compute resources are shared.

```python
notebookutils.notebook.runMultiple(["NotebookSimple", "NotebookSimple2"])
```

📌 **Challenge:** Use `runMultiple` to run both **childNotebook1** and **childNotebook2**:

<details>
  <summary><strong>🔑 Answer:</strong> Click to reveal</summary>

~~~python
exitValues = notebookutils.notebook.runMultiple(["childNotebook1", "childNotebook2"])
print(exitValues)
~~~
</details>

💡 **Tip:** you can use the `json` Python module to parse and format the exit values:

<br>

```python
import json
print(json.dumps(exitValues, indent=4))
```

![nbutils.multirun](https://github.com/voidfunction/FabCon25SparkWorkshop/blob/main/module-2-orchestrating-spark/_media/Reference%20multi%20notebooks%20via%20nbutils.jpg?raw=true)


<br>

#### **2.2.3.1 Specifying a DAG for Additional Control**
Run the below code to see an example of how you have use a DAG to control the exact sequencing and Notebook level configuration options:

In [None]:
DAG = {
    "activities": [
        {
            "name": "step1", # activity name, must be unique
            "path": "childNotebook1", # notebook path
            "timeoutPerCellInSeconds": 90, # max timeout for each cell, default to 90 seconds
            "args": {"parameter1": "foo", "parameter2": "bar"}, # notebook parameters
        },
        {
            "name": "step2",
            "path": "childNotebook2",
            "timeoutPerCellInSeconds": 120,
            "args": {"input1": "foo", "input2": "bar"}
        }
    ],
    "timeoutInSeconds": 43200, # max timeout for the entire DAG, default to 12 hours
    "concurrency": 50 # max number of notebooks to run concurrently, default to 50, this is limited by the number of executors in your Spark Pool.
}
notebookutils.notebook.runMultiple(DAG, {"displayDAGViaGraphviz": True})