# Run spark with prefect

In CASD, we recommend our clients to use `Prefect orchestrates tasks and flows(e.g. scheduling, retries, dependencies, observability)`, while `PySpark executes distributed computations(e.g. data processing, transformation, etc.)`.

> That's why we do not recommend our clients use Prefect parallelism to do data processing.

## 1. Integrate spark into prefect task

There are two options:
- In process SparkSession
- Use command line spark-submit (Not recommended for Windows server)

from platform import python_branch

### 1.1 A simple example

In the below example, we define a workflow which has two tasks:
- task1: read a text file, count numbers of unique words, then write result in a csv file
- task2: read the output csv file of task 1, filter the results with a given list

```python
import logging

from prefect import flow, task, get_run_logger
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# the partial file path works, because the prefect worker and spark runs on mode local
# if the prefect worker and spark worker runs on remote server. The path won't work.
data_dir = "../../data"
# data_dir = "C:/Users/PLIU/Documents/git/Seminar_workflow_automation/data"

# you need to change the username value. So the spark.local.dir file path is dedicated to your environment to avoid file access conflict.
user_name = "pengfei"


@task(name="task_1",
      description="task 1 read a text file, count numbers of unique words, then write result in a csv file")
def wordcount_task(source_file: str, out_file: str):
    spark = (
        SparkSession.builder
        .appName("prefect_wordcount")
        .master("local[4]")  # Limit CPU usage
        .config("spark.local.dir", f"{data_dir}/spark_temp/{user_name}")
        .getOrCreate()
    )

    df = spark.read.text(source_file)
    counts = df.rdd.flatMap(lambda x: x[0].split()) \
        .map(lambda w: (w, 1)) \
        .reduceByKey(lambda a, b: a + b)
    counts.toDF(["word", "count"]).write.mode("overwrite").csv(out_file)
    spark.stop()


@task(name="task_2", description="task 2 reads the output csv file of task 1, filter the results with a given list")
def filter_task(source_file: str, out_file: str, target_words: list[str]):
    spark = (
        SparkSession.builder
        .appName("prefect_word_filter")
        .master("local[4]")  # Limit CPU usage
        .config("spark.local.dir", f"{data_dir}/spark_temp/{user_name}")
        .getOrCreate()
    )
    schema = StructType([
        StructField("word", StringType(), True),
        StructField("count", IntegerType(), True)])

    df = spark.read.csv(source_file, header=False, schema=schema)
    result = df.filter(col("word").isin(target_words))
    result.write.mode("overwrite").csv(out_file)
    spark.stop()


@flow(name="spark_wordcount_flow",
      description="This workflow read plain text file and count words, we handle the error with task state",
      version="1.0.0")
def main_flow(target_words: list[str]):
    # set up logger
    logger = get_run_logger()
    logger.setLevel(logging.INFO)
    # run task 1
    src_file1 = f"{data_dir}/source/word_raw.txt"
    out_file1 = f"{data_dir}/out/wc_out"
    t1_state = wordcount_task(src_file1, out_file1, return_state=True)
    # check task 1 state, before start task2
    if t1_state.is_completed():
        # run task 2
        out_file2 = f"{data_dir}/out/flow_out"
        filter_task(out_file1, out_file2, target_words)
    else:
        logger.error("The task 1 does not complete with success, no need to run the task 2.")


if __name__ == "__main__":
    target_words = ["data", "file"]
    main_flow(target_words)

```

Pay attention:
- the file path configuration
- why each task has a spark session definition?
- how the error handling is done?

## 2. Resource concurrency

As prefect allows users to run workflow in parallel, if we don't set up `concurrency limit`. we may face the server overload problem.

For example, if the server has `8VCore, and 32GB memory`, we have three workflows each takes 4Vcores and 16GB memory(i.e. a typical spark job). We won't have enough resources to run the three workflows correctly.
But the server will continuously try to finish the workflow with `fairness`. Which means give 10 secs for workflow 1, then 10 secs for workflow2, etc. Because it does not have enough for every workflow at the same time. In this scenario, you will experience lag everywhere, even the Windows file explorer.

To avoid this condition, CASD propose the below set-up(Here, we suppose the server has 8VCore, and 32GB memory):
- Global concurrency limit = 2 (it means only two workflows can run simultaneously)
- `work-pool=1` and attached `worker concurrency limit=1` for each user. (one user can run a workflow at the same time)
- spark session setup must have two options: `local[4]`(uses 4 core), .config("spark.driver.memory", "8g")

> do not use `local[*]` when create spark session. This will use all cpu of the server.

The below figure shows an example. Four users launch a workflow at the same time, 16 cores are required, with our setup, only two workflows can run at the same, the other two workflows will be waiting in the work-pool queue. So we will not have the overflow situation

```text
                ┌────────────────────────────────────────────┐
                │              Prefect Server                │
                └────────────────────────────────────────────┘
                                  │
                        (global concurrency limit, limit=2)
                                  │
      ┌──────────────────┬──────────────────┬──────────────────┬──────────────────┐
      │ User A Worker    │ User B Worker    │ User C Worker    │ User D Worker    │
      │ --limit 1        │ --limit 1        │ --limit 1        │ --limit 1        │
      └──────────────────┴──────────────────┴──────────────────┴──────────────────┘
        Spark  │  local[4]  Spark │  local[4]  Spark │ local[4]  Spark │ local[4]
               ▼                  ▼                  ▼                 ▼
        4 cores required     4 cores required   4 cores required   4 cores required

```



## 3. Use spark submit in prefect (Advance)

You can try to use the below command to run a spark job via spark-submit

```powershell
spark-submit --master "local[4]" --conf "spark.driver.memory=4g" .\projects\05_run_spark_with_prefect\spark_jobs\word_count.py ".\data\source\word_raw.txt" ".\data\out\spark_submit_out"
```

