# Use Prefect Cloud with Saturn Cloud




**NOTE**: rrerequsities
* created a Prefect Cloud account
* set up the appropriate credentials in Saturn
* set up a Prefect Cloud agent in Saturn Cloud

Details on these prerequisites can be found in ["Using Prefect Cloud with Saturn Cloud"](https://saturncloud.io/docs/examples/prefect/prefect_cloud/).



<hr>

## Environment Setup

The code in this notebook uses `prefect` for orchestration *(figuring out what to do, and in what order)* and `dask` for execution *(doing the things)*.

It relies on the following additional non-builtin libraries:

* `pyspark`: data manipulation
* `pyspark`: read in data from the server
* `dask-saturn`: create and interact with Saturn Cloud `Dask` clusters ([link](https://github.com/saturncloud/dask-saturn))
* `prefect-saturn`: register Prefect flows with both Prefect Cloud and have them run on Saturn Cloud Dask clusters ([link](https://github.com/saturncloud/prefect-saturn))

In [11]:
import json
import numpy as np
import os
import pandas as pd
import prefect
import requests
import uuid
from datetime import datetime, timedelta
from prefect import task, Flow, Parameter
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType, FloatType
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import lit
import pyspark.sql.functions as F
sc = SparkContext('local')
spark = SparkSession(sc)


from datetime import datetime, timedelta
from io import BytesIO
from prefect import task, Parameter, Flow
from prefect.schedules import IntervalSchedule
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from zipfile import ZipFile

from prefect_saturn import PrefectCloudIntegration

PREFECT_CLOUD_PROJECT_NAME = 'TLCData' #os.environ["TLCData"]
SATURN_USERNAME = os.environ["SATURN_USERNAME"]

Authenticate with Prefect Cloud.

In [21]:
#!prefect auth login --key ${'mTEaxoDLi5NnI4DUWBvZvw'}
#!prefect auth login -t <'mTEaxoDLi5NnI4DUWBvZvw'>

! prefect auth login -t mTEaxoDLi5NnI4DUWBvZvw

[32mLogged in to Prefect Cloud tenant "basangarisoujanya@gmail.com's Account" (basangarisoujanya-gmail-com-s-account)[0m


<hr>

## Create a Prefect Cloud Project

Prefect Cloud organizes flows within workspaces called "projects". Before you can register a flow with Prefect Cloud, it's necessary to create a project if you don't have one yet.

The code below will create a new project in whatever Prefect Cloud tenant you're authenticated with. If that project already exists, this code does not have any side effects.

In [71]:
client = prefect.Client()
client.create_project(project_name='PREFECT_CLOUD_PROJECT_NAME')

'121c084a-5c03-498d-9a5f-fcc5042716f1'

<hr>

## Define Tasks

`prefect` refers to a workload as a "flow", which comprises multiple individual things to do called "tasks". From [the Prefect docs](https://docs.prefect.io/core/concepts/tasks.html):

* `get_trial_id()`: assign a unique ID to each run
* `get_trial_id()`: assign a unique ID to each run
* `extract()`: extract data from cloud(where csv for three years are stored)
* `transform()`: transform dataset to column oriented and row oriented 
* `load()`:  merge datasets and load in database
* `get_trial_summary()`: collect all evaluation metrics in one object
* `write_trial_summary()`: write trial results somewhere

In [72]:
import json
import requests
import pandas as pd
from datetime import datetime, timedelta
from prefect import task, Flow, Parameter

@task(max_retries=10, retry_delay=timedelta(seconds=10))
def extract(url: str) -> dict:
    try :
        ypath = url+"yellow_tripdata_*.csv"
        gpath = url+"green_tripdata_*.csv"
      
        taxi_schema = StructType([StructField("VendorID", IntegerType(), False),
                                  StructField("pickup_datetime", TimestampType(), False),
                                  StructField("dropoff_datetime", TimestampType(), False),
                                  StructField("store_and_fwd_flag", StringType(), False),
                                  StructField("RatecodeID", IntegerType(), False),
                                  StructField("PULocationID", IntegerType(), False),
                                  StructField("DOLocationID", IntegerType(), False),
                                  StructField("passenger_count", IntegerType(), False),
                                  StructField("trip_distance", FloatType(), False),
                                  StructField("fare_amount", FloatType(), False),
                                  StructField("extra", FloatType(), False),
                                  StructField("mta_tax", FloatType(), False),
                                  StructField("tip_amount", FloatType(), False),
                                  StructField("tolls_amount", FloatType(), False),
                                  StructField("ehail_fee", FloatType(), False),
                                  StructField("improvement_surcharge", FloatType(), False),
                                  StructField("total_amount", FloatType(), False),
                                  StructField("payment_type", IntegerType(), False),
                                  StructField("trip_type", IntegerType(), False)])

        yellow_df = spark.read.option("header", True)\
        .schema(taxi_schema) \
        .csv(ypath)\
        .withColumnRenamed("tpep_pickup_datetime", "pickup_datetime") \
        .withColumnRenamed("tpep_dropoff_datetime", "dropoff_datetime")\
        .withColumn("taxi_type", lit("yellow")) \
        .withColumn("ehail_fee", lit(0.0)) 
   
    
        green_df = spark.read.option("header", True)\
        .schema(taxi_schema) \
        .csv(gpath) \
        .withColumnRenamed("lpep_pickup_datetime", "pickup_datetime") \
        .withColumnRenamed("lpep_dropoff_datetime", "dropoff_datetime")\
        .withColumn("taxi_type", lit("green"))

    except:
        raise Exception('No data fetched!')
    
    return yellow_df,green_df


@task
def transform(yellow_df: pd.DataFrame,green_df: pd.DataFrame):
    #Add hour column
    yellow_df = yellow_df.withColumn("pickup_hour", F.from_unixtime(F.unix_timestamp(col("pickup_datetime"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00"))
    green_df = green_df.withColumn("pickup_hour", F.from_unixtime(F.unix_timestamp(col("pickup_datetime"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00"))
    yellow_df = yellow_df.withColumn("dropoff_hour", F.from_unixtime(F.unix_timestamp(col("dropoff_datetime"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00"))
    green_df = green_df.withColumn("dropoff_hour", F.from_unixtime(F.unix_timestamp(col("dropoff_datetime"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00"))

    
    
    taxi_df = yellow_df.union(green_df)
    taxi_schema = StructType(
      [StructField("VendorID", IntegerType(), False),
      StructField("pickup_datetime", TimestampType(), False),
      StructField("dropoff_datetime", TimestampType(), False),
      StructField("store_and_fwd_flag", StringType(), False),
      StructField("RatecodeID", IntegerType(), False),
      StructField("PULocationID", IntegerType(), False),
      StructField("DOLocationID", IntegerType(), False),
      StructField("passenger_count", IntegerType(), False),
      StructField("trip_distance", FloatType(), False),
      StructField("fare_amount", FloatType(), False),
      StructField("extra", FloatType(), False),
      StructField("mta_tax", FloatType(), False),
      StructField("tip_amount", FloatType(), False),
      StructField("tolls_amount", FloatType(), False),
      StructField("ehail_fee", FloatType(), False),
      StructField("improvement_surcharge", FloatType(), False),
      StructField("total_amount", FloatType(), False),
      StructField("payment_type", IntegerType(), False),
      StructField("trip_type", IntegerType(), False),
      StructField("taxi_type", IntegerType(), False),
      StructField("pickup_hour", IntegerType(), False),
      StructField("dropoff_hour", IntegerType(), False)])
    
    taxi_df.write.option("schema",taxi_schema).mode('append').parquet("https://cloud.uni-koblenz.de/s/tTcoPwsBdoXnWcG/parquet/taxi_df.parquet")

    avro_schema = { "type": "record",
    "name":"avro_schema",
    "type":"record",
        "fields":[
            {"type":"int", "name":"VendorID"},
            {"type":"datetime", "name":"pickup_datetime"}
            {"type":"datetime", "name":"dropoff_datetime"}
            {"type":"string", "name":"store_and_fwd_flag"}
            {"type":"int", "name":"RatecodeID"}
            {"type":"int", "name":"PULocationID"}
            {"type":"int", "name":"DOLocationID"}
            {"type":"int", "name":"passenger_count"}
            {"type":"float", "name":"trip_distance"}
            {"type":"float", "name":"fare_amount"}
            {"type":"float", "name":"extra"}
            {"type":"float", "name":"mta_tax"}
            {"type":"float", "name":"tip_amount"}
            {"type":"float", "name":"tolls_amount"}
            {"type":"float", "name":"ehail_fee"}
            {"type":"float", "name":"improvement_surcharge"}
            {"type":"float", "name":"total_amount"}
            {"type":"float", "name":"payment_type"}
            {"type":"float", "name":"trip_type"}
            {"type":"float", "name":"taxi_type"}
            {"type":"float", "name":"pickup_hour"}
            {"type":"float", "name":"dropoff_hour"}
        ]
     }
    
    taxi_df.write.option("forceSchema", avro_schema).save("https://cloud.uni-koblenz.de/s/tTcoPwsBdoXnWcG/parquet/taxi_df.avro")
   
    # taxi_df_parquet = spark.read.parquet("https://cloud.uni-koblenz.de/s/tTcoPwsBdoXnWcG/parquet/taxi_df.parquet")
    
    # taxi_df_avro = sqlContext.read.format("com.databricks.spark.avro").load("https://cloud.uni-koblenz.de/s/tTcoPwsBdoXnWcG/parquet/taxi_df.avro")
    
    return taxi_df


@task
def load(taxi_df: pd.DataFrame, path: str) -> None:
    
    # If output is needed in csv 
    taxi_df.write.csv('output.csv')
    #set variable to be used to connect the database
    database = "TestDB"
    table = "dbo.tbl_spark_df"
     #write the dataframe into a sql table
    taxi_df.write.mode("overwrite") \
    .format("jdbc") \
    .option("url", f"jdbc:sqlserver://localhost/SQLEXPRESS;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .save()

    #for updating
    #taxi_df.write.mode(SaveMode.Append).jdbc(JDBCurl,mySqlTable,connectionProperties)
    
@task
def get_trial_id() -> str:
    #Generate a unique identifier for this trial.

    return str(uuid.uuid4())


@task
def get_trial_summary(trial_id: str, taxi_df: pd.DataFrame) -> dict:
    out = {"id": trial_id}
    out["data"] = {
        "num_obs": taxi_df.shape[0],
    }
    return out


@task(log_stdout=True)
def write_trial_summary(trial_summary: str):
    """
    Write out a summary of the file. Currently just logs back to the
    Prefect logger
    """
    logger = prefect.context.get("logger")
    logger.info(json.dumps(trial_summary))

<hr>

## Construct a Flow

Now that all of the task logic has been defined, the next step is to compose those tasks into a "flow". From [the Prefect docs](https://docs.prefect.io/core/concepts/flows.html):

Because we want this job to run on a schedule, the code below provides one additional argument to `Flow()`, a special "schedule" object. In this case, the code below says "run this flow once every 24 hours".

In [73]:
schedule = IntervalSchedule(interval=timedelta(hours=24))

*NOTE: `prefect` flows do not have to be run on a schedule. To test a single run, just omit `schedule` from the code block below.*

In [80]:


with Flow(f"{SATURN_USERNAME}-tlcdata", schedule=schedule) as flow:
    param_url = Parameter(name='p_url', required=True)
    
    yellow_df,green_df = extract(url=param_url)
    taxi_df = transform(yellow_df,green_df)
    load(data=taxi_df, path=f'C:/Users/Soujanya/users_{int(datetime.now().timestamp())}.csv')
    batch_size = Parameter("batch-size", default=1000)
    trial_id = get_trial_id()

    # get trial summary in a string
    trial_summary = get_trial_summary(
        trial_id=trial_id,
        input_df=taxi_df,
    )

    # store trial summary
    trial_complete = write_trial_summary(trial_summary)
    sc.stop()

At this point, we have all of the work defined in tasks and arranged within a flow, but none of the tasks have run yet. In the next section, we'll do that using `Dask`.

<hr>

## Register with Prefect Cloud

Now that the business logic of the flow is complete, we can add information that Saturn will need to know to run it.

In [81]:
integration = PrefectCloudIntegration(prefect_cloud_project_name=PREFECT_CLOUD_PROJECT_NAME)

Next, run `register_flow_with_saturn().

`register_flow_with_saturn()` does a few important things:
    
The code below also customizes the Dask cluster used when executing the flow.

* `n_workers = 3`: use 3 workers
* `worker_size ="xlarge"`: each worker has 2 CPU cores and 16 GB RAM
    - **NOTE**: you can find the full list of sizes with `prefect_saturn.describe_sizes()`
* `worker_is_spot = False`: don't use spot instances for workers

**NOTE:** dask clusters associated with prefect cloud flows will be autoclosed when the flow run completes.

In [84]:
flow = integration.register_flow_with_saturn(
    flow=flow,
    dask_cluster_kwargs={
        "n_workers": 3,
        "worker_size": "xlarge",
        "scheduler_size": "medium",
        "worker_is_spot": False,
    },
)

flow.run(parameters={
        'p_url': 'https://jsonplaceholder.typicode.com/DOES_NOT_EXIST'
    }

The final step necessary is to "register" the flow with Prefect Cloud. 

In [1]:
flow.register(project_name="PREFECT_CLOUD_PROJECT_NAME")

<hr>

## Run the flow

If you want to run the flow immediately, navigate to the flow in the Prefect Cloud UI and click "Quick Run", or open a terminal and run the code below.

```shell
prefect auth login --key ${PREFECT_USER_TOKEN}
prefect run flow \
    --name ${SATURN_USERNAME}-ticket-model-evaluation \
    --project ${PREFECT_CLOUD_PROJECT_NAME}
```