!!!!!!! before running this project you need to enable vortexAI and BigQuery on google cloud, also edit the project ID to your liking
you also need to manually install package pyarrow with:  pip install google-cloud-aiplatform[datasets] (although it should be included in requirements.txt)

In [1]:
from google.cloud import aiplatform
from vertexai import preview

First we define our project ID and our region

In [2]:
PROJECT_ID = "automl-408813"  # @param {type:"string"}

# Set the project ID
! gcloud config set project {PROJECT_ID}
REGION = "europe-west1"  # @param {type: "string"}

Updated property [core/project].


Then we log into gcloud so we can start performing our actions

In [3]:
! gcloud auth login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=W3wwDRReIP9YOSLWGr6e3MoNGY7gJ0&access_type=offline&code_challenge=jwJb82v_FHyQeb1FprYxbfjIC8_GlwHz7ZU-ebUEGyo&code_challenge_method=S256


You are now logged in as [max.etman@student.kdg.be].
Your current project is [automl-408813].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


initialise the aiplatform in our project and region

In [6]:
display_name = "gsod_unique"

aiplatform.init(project=PROJECT_ID, location=REGION)

Then we create a bigquery dataset where we will store our table of data

In [7]:
from google.cloud import bigquery

# Replace with your project, dataset, and table names
dataset_id = "dai56_automl_v1"
table_id = "spotify_tracks_dai56_automl_v1"

def create_dataset(project_id, dataset_id):
    client = bigquery.Client(project=project_id, location=REGION)
    dataset_ref = client.dataset(dataset_id)

    dataset = bigquery.Dataset(dataset_ref)
    dataset = client.create_dataset(dataset)
    print(f"Dataset {dataset.dataset_id} created.")

def create_table(project_id, dataset_id, table_id, schema):
    client = bigquery.Client(project=project_id, location=REGION)
    dataset_ref = client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)

    table = bigquery.Table(table_ref, schema=schema)
    table = client.create_table(table)
    print(f"Table {table.table_id} created.")

# Define the schema for your table
schema = [
    bigquery.SchemaField("popularity", "FLOAT"),
    bigquery.SchemaField("loudness", "FLOAT"),
    bigquery.SchemaField("explicit", "FLOAT"),
    bigquery.SchemaField("danceability", "FLOAT"),
    bigquery.SchemaField("time_signature", "FLOAT"),
    bigquery.SchemaField("tempo", "FLOAT"),
    bigquery.SchemaField("energy", "FLOAT"),
    bigquery.SchemaField("key", "FLOAT"),
    bigquery.SchemaField("liveness", "FLOAT"),
    bigquery.SchemaField("duration_ms", "FLOAT"),
    bigquery.SchemaField("mode", "FLOAT"),
    bigquery.SchemaField("acousticness", "FLOAT"),
    bigquery.SchemaField("valence", "FLOAT"),
    bigquery.SchemaField("speechiness", "FLOAT"),
    bigquery.SchemaField("instrumentalness", "FLOAT"),
]

# Create dataset (uncomment the line below if you want to create the dataset as well)
create_dataset(PROJECT_ID, dataset_id)

# Create table
create_table(PROJECT_ID, dataset_id, table_id, schema)

client = bigquery.Client(project=PROJECT_ID)
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1  # Skip header row if present

with open("processed_dataset.csv", "rb") as source_file:
    job = client.load_table_from_file(source_file, table_ref, job_config=job_config)

job.result()  # Wait for the job to complete

Here we create the TabularDataset on the aiplatform from the BigQuery table

In [9]:
import pandas as pd
df = pd.read_csv('processed_dataset.csv')
dataset_id = "dai56_automl_v1"
table_id = "spotify_tracks_dai56_automl_v1"

dataset = aiplatform.TabularDataset.create_from_dataframe(
    display_name="NOAA historical weather data_unique",
    staging_path=f"bq://{PROJECT_ID}.{dataset_id}.{table_id}",
    df_source=df,
)


print(dataset.resource_name)

Creating TabularDataset
Create TabularDataset backing LRO: projects/999338700669/locations/europe-west1/datasets/2005491616877379584/operations/8966005501850550272
TabularDataset created. Resource name: projects/999338700669/locations/europe-west1/datasets/2005491616877379584
To use this TabularDataset in another session:
ds = aiplatform.TabularDataset('projects/999338700669/locations/europe-west1/datasets/2005491616877379584')
projects/999338700669/locations/europe-west1/datasets/2005491616877379584


Now we create the training job of the AutoMl Tabular Learner

In [14]:
# ds = aiplatform.TabularDataset('projects/999338700669/locations/europe-west1/datasets/7927725126869581824')
TRANSFORMATIONS = [
    {"auto": {"column_name": "instrumentalness"}},
    {"auto": {"column_name": "loudness"}},
    {"auto": {"column_name": "valence"}},
     {"auto": {"column_name": "time_signature"}},
]
label_column = "danceability"

job = aiplatform.AutoMLTabularTrainingJob(
    display_name=display_name,
    optimization_prediction_type="regression",
    optimization_objective="minimize-rmse",
    column_transformations=TRANSFORMATIONS,
)

print(job)

<google.cloud.aiplatform.training_jobs.AutoMLTabularTrainingJob object at 0x0000027BBA894340>


This is where we train the model, it takes about 2 hours to do its job. the reason we took 1 node hour (=1000 milli node hours) is because that is when the model starts improving at a rate that is so slow it makes it almost irrelevant. you can lower it for testing purposes but then the model will be less accurate

In [15]:
model = job.run(
    dataset=dataset,
    model_display_name=display_name,
    training_fraction_split=0.6,
    validation_fraction_split=0.2,
    test_fraction_split=0.2,
    budget_milli_node_hours=1000,
    disable_early_stopping=False,
    target_column=label_column,
)

View Training:
https://console.cloud.google.com/ai/platform/locations/europe-west1/training/6686081967433187328?project=999338700669
AutoMLTabularTrainingJob projects/999338700669/locations/europe-west1/trainingPipelines/6686081967433187328 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/999338700669/locations/europe-west1/trainingPipelines/6686081967433187328 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/999338700669/locations/europe-west1/trainingPipelines/6686081967433187328 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/999338700669/locations/europe-west1/trainingPipelines/6686081967433187328 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/999338700669/locations/europe-west1/trainingPipelines/6686081967433187328 current state:
PipelineState.PIPELINE_STATE_RUNNING
AutoMLTabularTrainingJob projects/999338700669/locations/europe-wes

Now we fetch the model evaluations and print them out, we can see that it is less effective than our local model by a fairly big margin. Although the upside is that all we had to do was provide the data

In [16]:
model_evaluations = model.list_model_evaluations()
if len(model_evaluations) > 0:
    eval_res = model_evaluations[0].to_dict()
    evaluation_metrics = eval_res["metrics"]
    print(evaluation_metrics)

{'rootMeanSquaredError': 0.14260665, 'meanAbsoluteError': 0.11392803, 'meanAbsolutePercentageError': 14580.551, 'rSquared': 0.34267348, 'rootMeanSquaredLogError': 0.0927907}


Now we will deploy the model to an endpoint, we define the machine type here

In [17]:
endpoint = model.deploy(machine_type="n1-standard-4")

Creating Endpoint
Create Endpoint backing LRO: projects/999338700669/locations/europe-west1/endpoints/3437328435125420032/operations/1435423974933659648
Endpoint created. Resource name: projects/999338700669/locations/europe-west1/endpoints/3437328435125420032
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/999338700669/locations/europe-west1/endpoints/3437328435125420032')
Deploying model to Endpoint : projects/999338700669/locations/europe-west1/endpoints/3437328435125420032
Deploy Endpoint model backing LRO: projects/999338700669/locations/europe-west1/endpoints/3437328435125420032/operations/5394088047392325632
Endpoint model deployed. Resource name: projects/999338700669/locations/europe-west1/endpoints/3437328435125420032


And then we test it out here by providing a single row of data and asking it to provide a prediction, the prediction is fairly accurate

In [21]:
instances_list = [{"instrumentalness": 0.000707, "loudness": 0.5737010524758152, "valence": 0.1437185929648241, "time_signature": 3}]

prediction = endpoint.predict(instances_list)
print(prediction)
print(f'actual value: 0.2700507614213198')

Prediction(predictions=[{'value': 0.3443959355354309, 'lower_bound': 0.1398083716630936, 'upper_bound': 0.6657944917678833}], deployed_model_id='3472983398190940160', model_version_id='1', model_resource_name='projects/999338700669/locations/europe-west1/models/6610008819491667968', explanations=None)
actual value: 0.2700507614213198


In [1]:
from google.cloud import aiplatform

instances_list = [{"instrumentalness": 0.000707, "loudness": 0.5737010524758152, "valence": 0.1437185929648241, "time_signature": 3}]


endpoint_location= "projects/999338700669/locations/europe-west1/endpoints/4367990261247115264"
# Make a prediction request to the endpoint
aiplatform.Endpoint.predict(aiplatform.Endpoint(endpoint_location), instances_list)
 

Prediction(predictions=[{'value': 0.3497030436992645, 'lower_bound': 0.1352253556251526, 'upper_bound': 0.7275293469429016}], deployed_model_id='8996015022463254528', model_version_id='1', model_resource_name='projects/999338700669/locations/europe-west1/models/2298797338702905344', explanations=None)