# Setup Prefect working env

To work with Prefect, you need three things:
- Prefect server: Coordinates with worker to run registered workflow
- Prefect worker: The compute module which runs the registered workflow
- Prefect workflow: The tasks logics which are defined by users.

The `Prefect server` is managed by CASD inside each bulle, run as daemon service.

> Users need to manage their workers and workflows.


## 1. Prepare your python virtual env for prefect

```powershell
# open a conda shell with python 11

# create a dedicated virtual env if you don't have one
conda create --name <username>-prefect python=python-version --offline

# for example
# change pengfei to your name
conda create --name pengfei-prefect python --offline

# activate your virtual env
conda activate pengfei-prefect

# install prefect package
pip install prefect

# check the installed prefect version
prefect version

# expected output
Version:              3.4.25
API version:          0.8.4
Python version:       3.11.8
...
```

> This virtual env will be used by your prefect worker, if your workflow requires a python package, you need to install the package inside this virtual env. For example, if my workflow uses pyspark, I need to install pyspark inside this virtual env


## 2. Register a prefect worker in the prefect server

Why `CASD does not propose a general worker` for everyone? As a workflow may call any python package, `CASD can not pre-install all required python packages in a worker`, which will slow down the worker. So we propose users to create their own worker and manager their own python package in their worker.

> The below instructions are the minimum configuration for creating a worker. We will give more details in a dedicated chapter on how to create and manage worker(e.g. queue, priority, etc.).

### 2.1 Check prefect server availability

Normally, the prefect server runs on `localhost:4200`. You can try to access the Prefect server in your browser by typing `http://localhost:4200`.

### 2.2 Configure your registry for the prefect server

Open a new powershell terminal, and type the below commands

```powershell
setx PREFECT_HOME "C:\Users\Public\Documents\prefect_home"
setx PREFECT_API_URL "http://localhost:4200/api"
```

### 2.3. Create and register a new worker pool

Go back to your virtual env terminal

```powershell
# create a new work pool with your user name
prefect work-pool create "%USERNAME%-pool" --type process

# then you can start a worker and add it to your worker pool
prefect worker start --pool "%USERNAME%-pool"

# expected output
Discovered type 'process' for work pool 'pliu-pool'.
Worker 'ProcessWorker a8986b12-18f9-40d6-863f-5e08681e63b4' started!
```
> The worker runs as process, so if you close your terminal, the process will be killed, and your worker will be stopped.

> Go back to your browser, and open `http://localhost:4200`. You should see the below figure

![prefect_work_pool.png](../assets/prefect_work_pool.png)

## 3. Create and run a simple workflow to test your worker

We recommend you to save your workflows in a directory which can be shared with other users, yet you can set some workflow private if you want.
Open a new powershell, and type the below  command

``` powershell
# create a personal dir to store your workflow
New-Item -Path "C:\Users\Public\Documents\prefect_home\flows\$Env:UserName" -Type Directory

# for example, if my username is pliu, a directory pliu will be created under flows. The full path will be
cd C:\Users\Public\Documents\prefect_home\flows\pliu
```

Create a file called `simple_etl.py` under `C:\Users\Public\Documents\prefect_home\flows\pliu`, and put the below command in it.

```python
from typing import List
from prefect import flow, task

@task
def extract()->List[int]:
    data = [1, 2, 3]
    msg = f"Task1: Extracting data: {data}"
    print(msg)
    return data

@task
def transform(data:List[int])->List[int]:
    print(f"Task2: Transforming data: {data}")
    return [x * 10 for x in data]

@task
def load(data:List[int]):
    print(f"Task3: Loading result {data}")

@flow
def etl_flow():
    # task 1
    data = extract()
    # task 2
    clean = transform(data)
    # task 3
    load(clean)


if __name__ == "__main__":
    etl_flow()

```

### 3.1 Run your workflow

Go back to your virtual env terminal, and run your workflow as a single run (This mode is for test purpose only.)


```powershell
# replace pliu by your own user name
cd C:\Users\Public\Documents\prefect_home\flows\pliu

# start your workflow as a single run
python simple_etl.py
```

> Go back to your browser, and open `http://localhost:4200`. Click on tab `Runs`, You should see the below figure

![simple_run.png](../assets/simple_run.png)


> The run name started with the flow name and followed by auto-generated string. Click on it, you will see the output of your workflow.

Below figure is an example

![flow_output_details.png](../assets/flow_output_details.png)

## 4. Register your workflow as a deployment

In the above example, we have launched a workflow manually. We could say we have done the automation part. Now we need to the orchestration part.
It means we need to specify and manage `when, where, and how a workflow should run`.

In prefect, this concept is implemented by a service called `deployment`. In this section, we will create a deployment by using the existing workflow.

> There is a breaking change between Prefect 2.x and 3.x. The way of how to build a deployment is completely changed. In this tutorial, I will
> only show the new approach in 3.x.

We need to define deployment specifications declaratively in a file called `prefect.yaml` in your flow folder. For example, my flow folder is
under `C:\Users\Public\Documents\prefect_home\flows\pliu`. You need to create `prefect.yaml` under it.

```yaml
# prefect.yaml
name: pengfei_prefect_quick_start
prefect-version: 3.4.24

# it can have multiple deployments
deployments:
  # Deployment 1: simple ETL flow
  - name: pengfei_simple_etl
    entrypoint: simple_etl:start_flow
    description: "This simple workflow shows how Prefect works"
    work_pool:
      name: pliu-pool
      type: process
      tags: [test, local]
```

The file architecture should be like
```text
C:\Users\Public\Documents\prefect_home\flows\pliu
│
├── simple_etl.py
└── prefect.yaml
```

> Prefect does not allow flow name duplication. So you need to replace `pengfei_simple_etl` by something like `username_simple_etl`
>
Now we need to register the deployment. You can use the below command to start a Prompt

```powershell
# go to your flow
# for example, if my username is pliu, a directory pliu will be created under flows. The full path will be
cd C:\Users\Public\Documents\prefect_home\flows\pliu

# start a deployment management Prompt
prefect deploy

# you should see the below output
Would you like to use an existing deployment configuration? [Use arrows to move; enter to select; n to select none]
┏━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name               ┃ Entrypoint               ┃ Description                                  ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ >  │ pengfei_simple_elt │ simple_etl.py:start_flow │ This simple workflow shows how Prefect works │
│    │                    │                          │ No, configure a new deployment               │
└────┴────────────────────┴──────────────────────────┴──────────────────────────────────────────────┘

# Just answer two times n for the prompt question. you should get the below outputs
Your Prefect workers will need access to this flow's code in order to run it. Would you like your workers to pull your flow code from a remote storage location when running this flow? [y/n] (y): n
Your Prefect workers will attempt to load your flow from: C:\Users\PLIU\Documents\git\Seminar_workflow_automation\flows\01_quick_start\simple_etl.py. To see more options for managing your flow's code, run:

        $ prefect init

? Would you like to configure schedules for this deployment? [y/n] (y): n
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Deployment 'start-flow/pengfei_simple_elt' successfully created with id 'f950cdcb-1f39-4380-b114-8656322ef64e'.                                                                                                                                                                                                 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

View Deployment in UI: http://localhost:4200/deployments/deployment/f950cdcb-1f39-4380-b114-8656322ef64e
```



> Go back to your browser, and open `http://localhost:4200`. Click on tab `Deployments`, You should see the below figure

![deployments_example.png](../assets/deployments_example.png)

> Click on the deployment, and click on the `Run` button. You should see a new workflow run is submitted to your work-pool

In the `Runs` tab, you can check the output of your workflow.

> We will provide more details in a dedicated Deployment section