## Simple example of Kubeflow pipeline to predict flight delays
Code is borrowed from https://aiinpractice.com/gcp-mlops-vertex-ai-pipeline-scikit-learn/ and
https://aiinpractice.com/gcp-mlops-vertex-ai-feature-store/

I takes around 10 min to run the pipeline. At the end of the notebook see how to use prediction endpoint.


Project works, but there are weird bugs with the bucket and data file. For some reason, only mpg3-temp-data works for now. 

#### Notes:
- Project works, but there are weird bugs with the bucket and data file. For some reason, only mpg3-temp-data works for now. 
- There are often bugs when trying to create an endpoint with the same name as previously created and deleted endpoint in the same region.
- The slowest part of the pipeline is deploying model to an endpoint. Using more powerful instance for an endpoint seems to speed up this step. after standard-8 more powerful insances seem to deploy slower. replica=3 seems to help too. Surprisingly, increasing replica count further makes model deployment slower. 

#### Next steps:
1. Figure out how to use any bucket and any data file. Done
2. Use more powerful instances to speed up all steps. Done, does not help much
3. Go to the next part and add preprocessing pipeline.
4. How to productionalize this pipeline?
- (i) write daily cron job to try to pull new monthly data. when it succeeds, trigger this pipeline to retrain the model and save new nobthly perf-eval artifact.
- (ii) simulate real-time user request daily. Use this to record daily perf-eval results dashboard.


#### 1. Setup

In [4]:
# install Python 3.8 kernel into py38 environment
# !bash startup.sh

In [9]:
from platform import python_version
python_version()

'3.8.16'

In [10]:
import time
from google.cloud import aiplatform as aip
PROJECT_ID = "polished-vault-379315"  
REGION = "us-central1"
BUCKET = 'mpg3-testflights-polished-vault-379315'

time0 = time.time()

aip.init(project=PROJECT_ID, staging_bucket=BUCKET, location=REGION)

flight_delays_feature_store = aip.Featurestore.create(
    "flight_delays1", online_store_fixed_node_count=1
)

flight_entity_type = flight_delays_feature_store.create_entity_type(
    entity_type_id="flight",
    description="Flight entity",
)

flight_entity_type.batch_create_features(
    {
        "origin_airport_id": {
            "value_type": "STRING",
            "description": "Airport ID for the origin",
        },
        "is_cancelled": {
            "value_type": "BOOL",
            "description": "Has the flight been cancelled or diverted?",
        },
        "departure_delay_minutes": {
            "value_type": "DOUBLE",
            "description": "Departure delay in minutes",
        },
        "arrival_delay_minutes": {
            "value_type": "DOUBLE",
            "description": "Arrival delay in minutes",
        },
        "taxi_out_minutes": {
            "value_type": "DOUBLE",
            "description": "Taxi out time in minutes",
        },
        "distance_miles": {
            "value_type": "DOUBLE",
            "description": "Total flight distance in miles.",
        },
    }
)

airport_entity_type = flight_delays_feature_store.create_entity_type(
    entity_type_id="airport",
    description="Airport entity",
)

airport_entity_type.create_feature(
    feature_id="average_departure_delay",
    value_type="DOUBLE",
    description="Average departure delay for that airport, calculated every 4h with 1h rolling window",
)

print(f'Time to create FeatureStore: {time.time()-time0:.2f} sec')

Creating Featurestore
Create Featurestore backing LRO: projects/662390005506/locations/us-central1/featurestores/flight_delays1/operations/9211118873116409856
Featurestore created. Resource name: projects/662390005506/locations/us-central1/featurestores/flight_delays1
To use this Featurestore in another session:
featurestore = aiplatform.Featurestore('projects/662390005506/locations/us-central1/featurestores/flight_delays1')
Creating EntityType
Create EntityType backing LRO: projects/662390005506/locations/us-central1/featurestores/flight_delays1/entityTypes/flight/operations/7159729242849148928
EntityType created. Resource name: projects/662390005506/locations/us-central1/featurestores/flight_delays1/entityTypes/flight
To use this EntityType in another session:
entity_type = aiplatform.EntityType('projects/662390005506/locations/us-central1/featurestores/flight_delays1/entityTypes/flight')
Batch creating features EntityType entityType: projects/662390005506/locations/us-central1/featu

In [11]:
# get featurestore_id from the above output. 
# if forgot, it may be easier to recreate fs with a new name.
FEATURE_STORE_ID = "662390005506"
# ENDPOINT_ID = "xxx"

In [12]:
# feature_pipeline/feature_pipeline/helpers.py

from typing import Union, get_args, get_origin
from datetime import datetime

def map_to_avro_type(field_type):
    if field_type == str:
        return "string"
    elif field_type == bool:
        return "boolean"
    elif field_type == float:
        return "double"
    elif field_type is type(None):
        return "null"
    elif field_type == datetime:
        return {"type": "long", "logicalType": "timestamp-micros"}
    elif get_origin(field_type) == Union:
        return [map_to_avro_type(t) for t in get_args(field_type)]
    else:
        raise NotImplementedError(f"Unsupported type: {field_type}")


def named_tuple_to_avro_fields(named_tuple):
    fields = []
    for field_name, field_type in named_tuple.__annotations__.items():
        fields.append({"name": field_name, "type": map_to_avro_type(field_type)})
    return fields


csv_headers = [
    "Year",
    "Quarter",
    "Month",
    "DayofMonth",
    "DayOfWeek",
    "FlightDate",
    "Reporting_Airline",
    "DOT_ID_Reporting_Airline",
    "IATA_CODE_Reporting_Airline",
    "Tail_Number",
    "Flight_Number_Reporting_Airline",
    "OriginAirportID",
    "OriginAirportSeqID",
    "OriginCityMarketID",
    "Origin",
    "OriginCityName",
    "OriginState",
    "OriginStateFips",
    "OriginStateName",
    "OriginWac",
    "DestAirportID",
    "DestAirportSeqID",
    "DestCityMarketID",
    "Dest",
    "DestCityName",
    "DestState",
    "DestStateFips",
    "DestStateName",
    "DestWac",
    "CRSDepTime",
    "DepTime",
    "DepDelay",
    "DepDelayMinutes",
    "DepDel15",
    "DepartureDelayGroups",
    "DepTimeBlk",
    "TaxiOut",
    "WheelsOff",
    "WheelsOn",
    "TaxiIn",
    "CRSArrTime",
    "ArrTime",
    "ArrDelay",
    "ArrDelayMinutes",
    "ArrDel15",
    "ArrivalDelayGroups",
    "ArrTimeBlk",
    "Cancelled",
    "CancellationCode",
    "Diverted",
    "CRSElapsedTime",
    "ActualElapsedTime",
    "AirTime",
    "Flights",
    "Distance",
    "DistanceGroup",
    "CarrierDelay",
    "WeatherDelay",
    "NASDelay",
    "SecurityDelay",
    "LateAircraftDelay",
    "FirstDepTime",
    "TotalAddGTime",
    "LongestAddGTime",
    "DivAirportLandings",
    "DivReachedDest",
    "DivActualElapsedTime",
    "DivArrDelay",
    "DivDistance",
    "Div1Airport",
    "Div1AirportID",
    "Div1AirportSeqID",
    "Div1WheelsOn",
    "Div1TotalGTime",
    "Div1LongestGTime",
    "Div1WheelsOff",
    "Div1TailNum",
    "Div2Airport",
    "Div2AirportID",
    "Div2AirportSeqID",
    "Div2WheelsOn",
    "Div2TotalGTime",
    "Div2LongestGTime",
    "Div2WheelsOff",
    "Div2TailNum",
    "Div3Airport",
    "Div3AirportID",
    "Div3AirportSeqID",
    "Div3WheelsOn",
    "Div3TotalGTime",
    "Div3LongestGTime",
    "Div3WheelsOff",
    "Div3TailNum",
    "Div4Airport",
    "Div4AirportID",
    "Div4AirportSeqID",
    "Div4WheelsOn",
    "Div4TotalGTime",
    "Div4LongestGTime",
    "Div4WheelsOff",
    "Div4TailNum",
    "Div5Airport",
    "Div5AirportID",
    "Div5AirportSeqID",
    "Div5WheelsOn",
    "Div5TotalGTime",
    "Div5LongestGTime",
    "Div5WheelsOff",
    "Div5TailNum",
]

In [13]:
from typing import NamedTuple, Optional
from datetime import datetime

class Flight(NamedTuple):
    timestamp: Optional[datetime]
    flight_number: str
    origin_airport_id: str
    is_cancelled: bool
    departure_delay_minutes: float
    arrival_delay_minutes: float
    taxi_out_minutes: float
    distance_miles: float


flight_avro_schema = {
    "namespace": "flight_delay_prediction",
    "type": "record",
    "name": "Flight",
    "fields": named_tuple_to_avro_fields(Flight),
}


class AirportFeatures(NamedTuple):
    timestamp: Optional[datetime]
    origin_airport_id: str
    average_departure_delay: float


airport_avro_schema = {
    "namespace": "flight_delay_prediction",
    "type": "record",
    "name": "Airport",
    "fields": named_tuple_to_avro_fields(AirportFeatures),
}

#### 2.Build pipeline

In [14]:
# batch_feature_pipeline.py

import argparse
import logging
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions


def parse_csv(line: str):
    import csv
    return next(csv.reader([line]))


def parse_line(fields):
    from datetime import datetime
    from apache_beam.utils.timestamp import Timestamp

    data = dict(zip(csv_headers, fields))

    if (
        data["Year"] != "Year"  # skip header row
        and len(data["WheelsOff"]) == 4  #
        and len(data["FlightDate"]) == 10  # row has a flight date
        and data["Distance"] != ""
    ):
        wheels_off_hour = data["WheelsOff"][:2]
        wheels_off_minutes = data["WheelsOff"][2:]
        departure_date_time = (
            f"{data['FlightDate']}T{wheels_off_hour}:{wheels_off_minutes}:00"
        )

        cancelled = (float(data["Cancelled"]) > 0) or (float(data["Diverted"]) > 0)

        try:
            flight = Flight(
                timestamp=datetime.fromisoformat(departure_date_time),
                origin_airport_id=str(data["OriginAirportID"]),
                flight_number=f"{data['Reporting_Airline']}//{data['Flight_Number_Reporting_Airline']}",
                is_cancelled=cancelled,
                departure_delay_minutes=float(data["DepDelay"]),
                arrival_delay_minutes=float(data["ArrDelay"]),
                taxi_out_minutes=float(data["TaxiOut"]),
                distance_miles=float(data["Distance"]),
            )

            yield beam.window.TimestampedValue(
                flight, Timestamp.from_rfc3339(departure_date_time)
            )
        except:
            pass


class BuildTimestampedRecordFn(beam.DoFn):
    def process(self, element, window=beam.DoFn.WindowParam):

        window_start = window.start.to_utc_datetime()
        return [
            AirportFeatures(
                timestamp=window_start,
                origin_airport_id=element.origin_airport_id,
                average_departure_delay=element.average_departure_delay,
            )._asdict()
        ]


class BuildTimestampedFlightRecordFn(beam.DoFn):
    def process(self, element: Flight, window=beam.DoFn.WindowParam):
        return [element._asdict()]


def run(argv=None, save_main_session=False):
    """Main entry point; defines and runs the wordcount pipeline.
    never mind default arguments, they will not be invoked."""

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--input",
        dest="input",
        default="/Users/simon/projects/private/gcp_mlops/data/processed/2020/2020-05.csv",
        help="Input file to process.",
    )
    parser.add_argument(
        "--output-airports",
        dest="output_airports",
        default="/Users/simon/projects/private/gcp_mlops/data/output_airports/",
        help="Output file to write results to.",
    )

    parser.add_argument(
        "--output-flights",
        dest="output_flights",
        default="/Users/simon/projects/private/gcp_mlops/data/output_flights/",
        help="Output file to write results to.",
    )

    parser.add_argument(
        "--output-read-instances",
        dest="output_read_instances",
        default="/Users/simon/projects/private/gcp_mlops/data/output_read_instances/",
        help="Output file to write results to.",
    )

    # Parse beam arguments (e.g. --runner=DirectRunner to run the pipeline locally)
    known_args, pipeline_args = parser.parse_known_args(argv)

    # We use the save_main_session option because one or more DoFn's in this
    # workflow rely on global context (e.g., a module imported at module level).
    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = save_main_session

    with beam.Pipeline(options=pipeline_options) as pipeline:
        flights = (
            pipeline
            | "read_input" >> beam.io.ReadFromText(known_args.input)
            | "parse_csv" >> beam.Map(parse_csv)
            | "create_flight_obj" >> beam.FlatMap(parse_line).with_output_types(Flight)
        )

        # Create airport data
        (
            flights
            | "window"
            >> beam.WindowInto(
                beam.window.SlidingWindows(4 * 60 * 60, 60 * 60)
            )  # 4h time windows, every 60min
            | "group_by_airport"
            >> beam.GroupBy("origin_airport_id").aggregate_field(
                "departure_delay_minutes",
                beam.combiners.MeanCombineFn(),
                "average_departure_delay",
            )
            | "add_timestamp" >> beam.ParDo(BuildTimestampedRecordFn())
            | "write_airport_data"
            >> beam.io.WriteToAvro(
                known_args.output_airports, schema=airport_avro_schema
            )
        )

        # Create flight data
        (
            flights
            | "format_output" >> beam.ParDo(BuildTimestampedFlightRecordFn())
            | "write_flight_data"
            >> beam.io.WriteToAvro(known_args.output_flights, schema=flight_avro_schema)
        )

        # Create read_instances.csv to retrieve training data from the feature store
        (
            flights
            | "format_read_instances_output"
            >> beam.Map(
                lambda flight: f"{flight.flight_number},{flight.origin_airport_id},{flight.timestamp.isoformat('T') + 'Z'}"
            )
            | "write_read_instances"
            >> beam.io.WriteToText(
                known_args.output_read_instances,
                file_name_suffix=".csv",
                num_shards=1,
                header="flight,airport,timestamp",
            )
        )

In [None]:
# run this pipeline

!bash startup.sh

In [None]:
# ingest batch features:

from google.cloud import aiplatform as aip
aip.init(project=PROJECT_ID, staging_bucket=BUCKET, location=REGION)

flight_delays_feature_store = aip.Featurestore(
    FEATURE_STORE_ID,
    project=PROJECT_ID,
    location=REGION,
)

flight_entity_type = flight_delays_feature_store.get_entity_type("flight")
flight_entity_type.ingest_from_gcs(
    feature_ids=[
        "origin_airport_id",
        "is_cancelled",
        "departure_delay_minutes",
        "arrival_delay_minutes",
        "taxi_out_minutes",
        "distance_miles",
    ],
    feature_time="timestamp",
    gcs_source_uris=f"gs://{BUCKET}/features/flight_features/*",
    gcs_source_type="avro",
    entity_id_field="flight_number",
)

airport_entity_type = flight_delays_feature_store.get_entity_type("airport")
airport_entity_type.ingest_from_gcs(
    feature_ids=["average_departure_delay"],
    feature_time="timestamp",
    gcs_source_uris=f"gs://{BUCKET}/features/airport_features/*",
    gcs_source_type="avro",
    entity_id_field="origin_airport_id",
)

In [8]:
# next step should be to run pipeline. in his repo, this is main.py.

In [100]:
USER_NAME="oo00011760@gmail.com" 
PROJECT_ID = "polished-vault-379315"  
REGION = "us-central1"
# REGION = "us-east1"
! gcloud config set project $PROJECT_ID

Updated property [core/project].


In [101]:
SERVICE_ACCOUNT = 'vertex-ai-service-account@polished-vault-379315.iam.gserviceaccount.com'
print(f'Service Account: {SERVICE_ACCOUNT}')

Service Account: vertex-ai-service-account@polished-vault-379315.iam.gserviceaccount.com


In [102]:
# BUCKET_NAME = 'training_data_' + PROJECT_ID
BUCKET_NAME = 'mpg3-testflights-polished-vault-379315'
# BUCKET_NAME = 'mpg3-temp-data'

BUCKET_URI = "gs://" + BUCKET_NAME
! gsutil mb -l $REGION $BUCKET_URI
! gsutil ls -al $BUCKET_URI

Creating gs://mpg3-testflights-polished-vault-379315/...
ServiceException: 409 A Cloud Storage bucket named 'mpg3-testflights-polished-vault-379315' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.
                                 gs://mpg3-testflights-polished-vault-379315/data/
                                 gs://mpg3-testflights-polished-vault-379315/pipeline-output/


In [103]:
! gsutil ls -al $BUCKET_URI

                                 gs://mpg3-testflights-polished-vault-379315/data/
                                 gs://mpg3-testflights-polished-vault-379315/pipeline-output/


In [104]:
# most of commands from setup.sh are still missing. need to translate bash code into python.

In [105]:
from google.cloud import aiplatform as aip
from kfp.v2.dsl import (
    Artifact,
    Dataset,
    Input,
    Model,
    Output,
    ClassificationMetrics,
    component,
    pipeline,
)
from kfp.v2 import compiler

from google_cloud_pipeline_components.v1.endpoint import EndpointCreateOp, ModelDeployOp
from google_cloud_pipeline_components.v1.model import ModelUploadOp


# BUCKET = f"training_data_{PROJECT_ID}"
BUCKET = BUCKET_NAME
pipeline_root_path = f"gs://{BUCKET}/pipeline-output/"
print(BUCKET)

mpg3-testflights-polished-vault-379315


In [106]:
pipeline_root_path

'gs://mpg3-testflights-polished-vault-379315/pipeline-output/'

In [107]:
@component(
    packages_to_install=['gcsfs', 'fsspec'],
    base_image="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest",
)
def data_download(
    data_url: str,
    split_date: str,
    dataset_train: Output[Dataset],
    dataset_test: Output[Dataset],
):
    import pandas as pd
    import logging

    logging.warn("Import file:", data_url)

    data = pd.read_csv(data_url, nrows=5000)

    cancelled = (data["Cancelled"] > 0) | (data["Diverted"] > 0)
    completed_flights = data[~cancelled]

    training_data = completed_flights[["DepDelay", "TaxiOut", "Distance"]]
    # Consider flights that arrive more than 15 min late as delayed
    training_data["target"] = completed_flights["ArrDelay"] > 15

    test_data = training_data[completed_flights["FlightDate"] >= split_date]
    training_data = training_data[completed_flights["FlightDate"] < split_date]

    training_data.to_csv(dataset_train.path, index=False)
    test_data.to_csv(dataset_test.path, index=False)

In [108]:
@component(
    base_image="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest",
)
def model_train(
    dataset: Input[Dataset],
    model: Output[Artifact],
):
    import pandas as pd
    import pickle
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression

    data = pd.read_csv(dataset.path)
    X = data.drop(columns=["target"])
    y = data["target"]

    model_pipeline = Pipeline(
        [
            ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler()),
            ("clf", LogisticRegression(random_state=42, tol=0.0001, max_iter=100)),
        ]
    )

    model_pipeline.fit(X, y)

    model.metadata["framework"] = "scikit-learn"
    model.metadata["containerSpec"] = {
        "imageUri": "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest"
    }

    file_name = model.path + "/model.pkl"
    import pathlib

    pathlib.Path(model.path).mkdir()
    with open(file_name, "wb") as file:
        pickle.dump(model_pipeline, file)

In [109]:
@component(
    base_image="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest",
)
def model_evaluate(
    test_set: Input[Dataset],
    model: Input[Model],
    metrics: Output[ClassificationMetrics],
):
    import pandas as pd
    import pickle
    from sklearn.metrics import roc_curve, confusion_matrix, accuracy_score

    data = pd.read_csv(test_set.path)[:1000]
    file_name = model.path + "/model.pkl"
    with open(file_name, "rb") as file:
        model_pipeline = pickle.load(file)

    X = data.drop(columns=["target"])
    y = data.target
    y_pred = model_pipeline.predict(X)

    y_scores = model_pipeline.predict_proba(X)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_true=y, y_score=y_scores, pos_label=True)
    metrics.log_roc_curve(fpr.tolist(), tpr.tolist(), thresholds.tolist())

    metrics.log_confusion_matrix(
        ["False", "True"],
        confusion_matrix(y, y_pred).tolist(),
    )

In [110]:
# Define the workflow of the pipeline.
@pipeline(name="gcp-mlops-v0", pipeline_root=pipeline_root_path)
def pipeline(
    training_data_url: str = f"gs://{BUCKET}/data/processed/2020/2020-05.csv",
    test_split_date: str = "2020-05-20",
):
    data_op = data_download(
        data_url=training_data_url,
        split_date=test_split_date
    )

    from google_cloud_pipeline_components.experimental.custom_job.utils import (
        create_custom_training_job_op_from_component,
    )

    custom_job_distributed_training_op = create_custom_training_job_op_from_component(
        model_train, 
        replica_count=1, 
        machine_type = 'n1-standard-8'
    )

    model_train_op = custom_job_distributed_training_op(
        dataset=data_op.outputs["dataset_train"],
        project=PROJECT_ID,
        location=REGION,
    )

    model_evaluate_op = model_evaluate(
        test_set=data_op.outputs["dataset_test"],
        model=model_train_op.outputs["model"],
    )

    model_upload_op = ModelUploadOp(
        project=PROJECT_ID,
        location=REGION,
        display_name="flight-delay-model",
        unmanaged_container_model=model_train_op.outputs["model"],
    ).after(model_evaluate_op)

    endpoint_create_op = EndpointCreateOp(
        project=PROJECT_ID,
        location=REGION,
        display_name="flight-delay-endpoint12",
    )

    ModelDeployOp(
        endpoint=endpoint_create_op.outputs["endpoint"],
        model=model_upload_op.outputs["model"],
        deployed_model_display_name="flight-delay-model",
        dedicated_resources_machine_type="n1-standard-8",
        dedicated_resources_min_replica_count=1,
        dedicated_resources_max_replica_count=3,
    )


In [111]:
compiler.Compiler().compile(pipeline_func=pipeline, package_path="gcp-mlops-v0.json")

aip.init(project=PROJECT_ID, staging_bucket=BUCKET, location=REGION)

job = aip.PipelineJob(
    display_name="gcp-mlops-v0",
    template_path="gcp-mlops-v0.json",
    pipeline_root=pipeline_root_path,
)

job.run(service_account=SERVICE_ACCOUNT)

Creating PipelineJob
PipelineJob created. Resource name: projects/662390005506/locations/us-central1/pipelineJobs/gcp-mlops-v0-20230423181825
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/662390005506/locations/us-central1/pipelineJobs/gcp-mlops-v0-20230423181825')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/gcp-mlops-v0-20230423181825?project=662390005506
PipelineJob projects/662390005506/locations/us-central1/pipelineJobs/gcp-mlops-v0-20230423181825 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/662390005506/locations/us-central1/pipelineJobs/gcp-mlops-v0-20230423181825 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/662390005506/locations/us-central1/pipelineJobs/gcp-mlops-v0-20230423181825 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/662390005506/locations/us-central1/pipelineJobs/gcp-mlops-v0-2023042

In [39]:
# predictions from Python

ENDPOINT_ID = '2557670754392997888'
# get it from gcloud ai endpoints list. gcloud config set project polished-vault-379315. 

from google.cloud import aiplatform as aip

aip.init(project=PROJECT_ID, location=REGION)
endpoint = aip.Endpoint(ENDPOINT_ID)
prediction = endpoint.predict(instances=[[-4.0, 16.0, 153.0]])
print(f'Prediction is: {prediction}')

Prediction is: Prediction(predictions=[False], deployed_model_id='2932338137650692096', model_version_id='1', model_resource_name='projects/662390005506/locations/us-central1/models/4506214266021347328', explanations=None)


In [14]:
# use the code below in Shell to test the endpoint.

# gcloud auth application-default login
nano INPUT.json

{
  "instances": [{1, 15, 400}]
}

ENDPOINT_ID="3891580669024796672"
PROJECT_ID="polished-vault-379315"
INPUT_DATA_FILE="INPUT.json"

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/endpoints/${ENDPOINT_ID}:predict \
-d "@${INPUT_DATA_FILE}"


SyntaxError: invalid syntax (3469231046.py, line 4)

In [None]:
training_data_url: str = f"gs://{BUCKET}/data/processed/2020/2020-05.csv"
training_data_url

In [None]:
import pandas as pd
data = pd.read_csv(training_data_url, nrows=2000)
data.head(2)

In [None]:
training_data = data[["DepDelay", "TaxiOut", "Distance"]]
training_data.head()