In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Using Vertex AI Feature Store with Pandas Dataframe

<table align="left">
    <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/feature_store/sdk-feature-store-pandas.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> 
        Run in Colab
    </a>
  </td>
    
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/feature_store/sdk-feature-store-pandas.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/feature_store/sdk-feature-store-pandas.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

This notebook introduces Pandas support for Feature Store using Vertex AI SDK. For pre-requisites and introduction on Vertex AI SDK and Feature Store native support, please go through this [Colab notebook](https://colab.sandbox.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/feature_store/sdk-feature-store.ipynb). 

Learn more about [Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore).

### Objective

In this notebook, you learn how to use `Vertex AI Feature Store` with pandas Dataframe.

This tutorial uses the following Google Cloud ML services and resources:

- Vertex AI Feature Store

The steps performed include:

- Create Featurestore, entity types and features.
- Ingest feature values from Pandas DataFrame into Feature Store's Entity types.
- Read Entity feature values from Online Feature Store into Pandas DataFrame.
- Batch serve feature values from your Feature Store into Pandas DataFrame.

You also learn how Vertex AI Feature Store can be useful in the below scenarios:

- Online serving with updated feature values.
- Point-in-time correctness to fetch feature values for training.

### Dataset

This tutorial is a part of the Feature Store tutorial notebooks. It uses a movie recommendation dataset as an example for demonstrating various functionalities of Feature Store. The original task is to train a model to predict if a user is going to watch a movie, and serve the model online.

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing),
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

In [None]:
! pip install --quiet --upgrade google-cloud-aiplatform \
                                google-cloud-bigquery \
                                google-cloud-bigquery-storage \
                                avro \
                                pyarrow \
                                pandas \
                                fsspec \
                                gcsfs

### Colab only: Uncomment the following cell to restart the kernel

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [None]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to serve as a staging bucket for Vertex AI and to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

### Import libraries

In [None]:
import datetime

import pandas as pd
from avro.datafile import DataFileReader
from avro.io import DatumReader
from google.cloud import aiplatform

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and region.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## Create a Feature Store

Vertex AI Feature Store serves you as a centralised and organised repository for your ML features. You can store, serve and monitor certain aspects of your features like their distributions and drift. Learn more about [Vertex AI Feature Store data model](https://cloud.google.com/vertex-ai/docs/featurestore/concepts), and [benefits of Vertex AI Feature Store](https://cloud.google.com/vertex-ai/docs/featurestore/overview#benefits).

To begin this tutorial, you create a feature store using the Vertex AI SDK for Python. A feature store is a top-level container for your features and their values. For this, you use the [`Featurestore.create()`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Featurestore) method which returns a LRO ([long-running operation](https://google.aip.dev/151)). A LRO starts an asynchronous job. LROs are returned for other API methods too, such as updating or deleting a featurestore. 

You pass the below parameters while creating the featurestore:

- `featurestore_id`: A unique name or id for your featurestore
- `online_store_fixed_node_count`: Config for online serving resources. The number of nodes will not scale automatically but can be scaled manually by providing different values when updating.

In [None]:
# set the id or name for the feature store
featurestore_id = "movie_predictions_unique"  # @param {type:"string"}

# Create featurestore
movie_predictions_feature_store = aiplatform.Featurestore.create(
    featurestore_id=featurestore_id, online_store_fixed_node_count=1
)

## Create Entity types

Using Vertex AI Feature Store, you can create and manage feature stores, entity types, and features. An entity type is a collection of semantically related features. You define your own entity types, based on the concepts that are relevant to your use case. For example, a movie service might have the entity types movie and user, which group related features that correspond to movies or customers.

Learn more about [Entity types](https://cloud.google.com/vertex-ai/docs/featurestore/concepts#entity_type).

Entity types are created within the Featurestore class. Below, you create the following entity types `users` and `movies` for the movie recommendation dataset.

You pass the following parameters while creating the entity types:

- `entity_type_id`: A unique name or id for your entity type.
- `description`: (Optional) Description for your entity type.

In [None]:
# Create users entity type
users_entity_type = movie_predictions_feature_store.create_entity_type(
    entity_type_id="users",
    description="Users entity",
)

# Create movies entity type
movies_entity_type = movie_predictions_feature_store.create_entity_type(
    entity_type_id="movies",
    description="Movies entity",
)

## Create Features

A feature is a measurable property or attribute of an entity type. For example, the movie entity type has features such as average_rating and title that track various properties of movies. Features are associated with entity types. 

Learn more about [Features](https://cloud.google.com/vertex-ai/docs/featurestore/concepts#feature).

Add the defined features to the entity types `users` and `movies` using the following methods.

### Add features using *`create_feature`* method

You provide the following parameters for creating features:

- `feature_id`: Resource name or an id for the Feature.
- `value_type`: Type of Feature value. One of BOOL, BOOL_ARRAY, DOUBLE, DOUBLE_ARRAY, INT64, INT64_ARRAY, STRING, STRING_ARRAY, BYTES.
- `description`: Description of the Feature.

In [None]:
# Create age feature
users_feature_age = users_entity_type.create_feature(
    feature_id="age",
    value_type="INT64",
    description="User age",
)

# Create gender feature
users_feature_gender = users_entity_type.create_feature(
    feature_id="gender",
    value_type="STRING",
    description="User gender",
)

# Create liked_genres feature
users_feature_liked_genres = users_entity_type.create_feature(
    feature_id="liked_genres",
    value_type="STRING_ARRAY",
    description="An array of genres this user liked",
)

### Add features using *`batch_create_features`* method

You can also add multiple features at a time using a config map in a dictionary format. For this, you use the `batch_create_features` method. 

Below, you define create `title`, `genres` and `average_rating` features.

In [None]:
# define the features
movies_feature_configs = {
    "title": {
        "value_type": "STRING",
        "description": "The title of the movie",
    },
    "genres": {
        "value_type": "STRING",
        "description": "The genre of the movie",
    },
    "average_rating": {
        "value_type": "DOUBLE",
        "description": "The average rating for the movie, range is [1.0-5.0]",
    },
}
# create the features
movie_features = movies_entity_type.batch_create_features(
    feature_configs=movies_feature_configs,
)

## Ingest Feature values into Entity types from dataframes

A Feature Store captures feature values for a feature belonging to an entity type at a specific point in time. After ingesting your feature values to feature store, you can later `read` (online) or `batch serve` (offline) the feature values from the entity type. 

In this section, you learn how to ingest feature values from a [Pandas dataframe](https://pandas.pydata.org/) into an entity type. 

Learn more about [Feature values](https://cloud.google.com/vertex-ai/docs/featurestore/concepts#feature_value).

Note: You can also import feature values from BigQuery or Google Cloud Storage.

### Get movie recommendation data from source

Define the data sources for users and movies and copy them locally into **.avro** files.

In [None]:
# set the users file source
GCS_USERS_AVRO_URI = (
    "gs://cloud-samples-data-us-central1/vertex-ai/feature-store/datasets/users.avro"
)
# set the movies file source
GCS_MOVIES_AVRO_URI = (
    "gs://cloud-samples-data-us-central1/vertex-ai/feature-store/datasets/movies.avro"
)
# set the local file names
USERS_AVRO_FN = "users.avro"
MOVIES_AVRO_FN = "movies.avro"
# copy the files using gsutil
! gsutil cp $GCS_USERS_AVRO_URI $USERS_AVRO_FN
! gsutil cp $GCS_MOVIES_AVRO_URI $MOVIES_AVRO_FN

### Load data from avro files 

Load users and movies data from the downloaded avro files into Pandas dataframes.

In [None]:
# Define a class for reading the avro data


class AvroReader:
    def __init__(self, data_file):
        self.avro_reader = DataFileReader(open(data_file, "rb"), DatumReader())

    def to_dataframe(self):
        records = [record for record in self.avro_reader]
        return pd.DataFrame.from_records(data=records)

In [None]:
# Load users data from avro file
users_avro_reader = AvroReader(data_file=USERS_AVRO_FN)
users_source_df = users_avro_reader.to_dataframe()
users_source_df.head()

In [None]:
# Load movies data from avro file
movies_avro_reader = AvroReader(data_file=MOVIES_AVRO_FN)
movies_source_df = movies_avro_reader.to_dataframe()
movies_source_df.head()

### Ingest Feature values into Entity types

Load the feature values into `users` entity type from the dataframe.

You provide the following parameters for ingesting the data:

- `feature_ids`: List of ids of the Feature to import values of. The Features must exist in the target EntityType, or the request will fail.
- `feature_time`: The source column that holds the Feature timestamp for all Feature values in each entity. It can also be a single Feature timestamp for all entities being imported.
- `df_source`: Pandas DataFrame containing the source data for ingestion.
- `entity_id_field`: Source column that holds entity IDs.

Learn more about the [`EntityType.ingest_from_df()`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.EntityType#google_cloud_aiplatform_EntityType_ingest_from_df) method.

In [None]:
# ingest the data for users
users_entity_type.ingest_from_df(
    feature_ids=["age", "gender", "liked_genres"],
    feature_time="update_time",
    df_source=users_source_df,
    entity_id_field="user_id",
)

Similarly, load the feature values for `movies` entity type.

In [None]:
# ingest the data for movies
movies_entity_type.ingest_from_df(
    feature_ids=["average_rating", "title", "genres"],
    feature_time="update_time",
    df_source=movies_source_df,
    entity_id_field="movie_id",
)

## Read Entity's feature values online from Feature Store

Feature Store allows online serving which lets you read feature values for small batches of entities. It is beneficial when you want to read values of selected features from an entity or multiple entities in an entity type.

Note: An entity is an instance of an entity type. For example, movie_01 and movie_02 are entities of the entity type movie.

Learn more about [Feature Store online serving](https://cloud.google.com/vertex-ai/docs/featurestore/serving-online).

### Read feature values for users

Call the [`EntityType.read()`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.EntityType#google_cloud_aiplatform_EntityType_read) method with the required entity ids from the entity type `users`.

In [None]:
# read the data to a dataframe
users_read_df = users_entity_type.read(
    entity_ids=["dave", "alice", "charlie", "bob", "eve"],
)
# dispaly the dataframe
users_read_df.head()

### Read feature values for movies

Call the [`EntityType.read()`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.EntityType#google_cloud_aiplatform_EntityType_read) method with the required entity ids from the entity type `movies`.

In [None]:
# read the data to a dataframe
movies_read_df = movies_entity_type.read(
    entity_ids=["movie_01", "movie_02", "movie_03", "movie_04"],
    feature_ids=["title", "genres", "average_rating"],
)
# display the dataframe
movies_read_df.head()

## Batch serve feature values from Feature Store

Feature Store can also serve the feature values in large batches for high-throughput. Batch serving is typically used for training a model or batch prediction. 

Learn more about [Feature Store batch serving](https://cloud.google.com/vertex-ai/docs/featurestore/serving-batch#batch_serving_inputs).

In this section, you learn how to prepare training examples by using the Feature Store's batch serve function.

### Load read instances from source file

Define the source file path that consists of some samples with feature values. Here, you load some feature values from the `movie_prediction.csv` file in the dataset using Pandas. This data serves as read instances while calling the batch serve function.

In [None]:
# set the gcs source for samples
GCS_READ_INSTANCES_CSV_URI = "gs://cloud-samples-data-us-central1/vertex-ai/feature-store/datasets/movie_prediction.csv"

While loading the data, parse the `timestamp` column as datetime field. This is because the feature store expects a timestamp field in the read instances when batch serving.

In [None]:
# load the data using pandas
read_instances_df = pd.read_csv(GCS_READ_INSTANCES_CSV_URI, parse_dates=["timestamp"])
# display the dataframe
read_instances_df.head()

### Batch serve feature values from Feature Store

Serve the batch response to a dataframe using the `batch_serve_to_df` method by providing the following parameters:

- `serving_feature_ids`: A user defined dictionary to define the entity_types and their features for batch serve/read. The keys of the dictionary are the serving entity_type ids and the values are lists of serving feature ids in each entity_type.
        
- `read_instances_df`: A pandas DataFrame containing the read instances. Each read instance should consist of exactly one read timestamp and one or more entity IDs identifying entities of the corresponding EntityTypes whose Features are requested. Each output instance contains Feature values of requested entities concatenated together as of the read time. 

    An example read_instances_df may be:

    ```
    pd.DataFrame( data=[ { 
            "my_entity_type_id_1": "my_entity_type_id_1_entity_1", 
            "my_entity_type_id_2": "my_entity_type_id_2_entity_1", 
            "timestamp": "2020-01-01T10:00:00.123Z" ], ) 
    ```
    An example batch_serve_output_df may be 
    
    ```
    pd.DataFrame( data=[ { 
            "my_entity_type_id_1": "my_entity_type_id_1_entity_1", 
            "my_entity_type_id_2": "my_entity_type_id_2_entity_1", 
            "foo": "feature_id_1_1_feature_value", 
            "feature_id_1_2": "feature_id_1_2_feature_value", 
            "feature_id_2_1": "feature_id_2_1_feature_value", 
            "bar": "feature_id_2_2_feature_value", 
            "timestamp": "2020-01-01T10:00:00.123Z" ], ) 
    ``` 
        

Note: Calling the `batch_serve_to_df` method automatically creates and deletes a temporary bigquery dataset in the same GCP project, which is used as the intermediary storage for batch serve feature values from featurestore to dataframe.

Learn more about [Feature Store batch serving to a dataframe](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Featurestore#google_cloud_aiplatform_Featurestore_batch_serve_to_df). 

In [None]:
# call the batch serve method
movie_predictions_df = movie_predictions_feature_store.batch_serve_to_df(
    serving_feature_ids={
        "users": ["age", "gender", "liked_genres"],
        "movies": ["title", "average_rating", "genres"],
    },
    read_instances_df=read_instances_df,
)
# display the dataframe
movie_predictions_df.head()

## Read the latest feature values

In Feature Store, you access the latest or the last available feature values unless a specific time is provided. 

Now, you test this feature by ingesting new data to the entity types and reading it from the Feature Store.

### Ingest updated feature values

Now, you update the feature values by running the following cell. 

**Note:** For comparison, you can try printing the feature values read from the entity types earlier (those in `movies_read_df` variable). 

In [None]:
# Create a dataframe for the new data
update_movies_df = pd.DataFrame(
    data=[["movie_03", 4.3], ["movie_04", 4.8]],
    columns=["movie_id", "average_rating"],
)

# Ingest the new data from the dataframe
movies_entity_type.ingest_from_df(
    feature_ids=["average_rating"],
    feature_time=datetime.datetime.now(),  # provide the current timestamp
    df_source=update_movies_df,
    entity_id_field="movie_id",
)

### Fetch the latest feature values

Reading from the entity type gives you the updated feature values from the latest ingestion.

Running the below cell should fetch the latest values for all the requested entities including `movie_03` and `movie_04` which you added in the last cell.

In [None]:
# read the feature values from the entity type
update_movies_read_df = movies_entity_type.read(
    entity_ids=["movie_01", "movie_02", "movie_03", "movie_04"],
    feature_ids=["title", "genres", "average_rating"],
)
# display the dataframe
update_movies_read_df.head()

## Point-in-time correctness

Vertex AI Feature Store captures feature values for a feature at a [specific point in time](https://cloud.google.com/vertex-ai/docs/featurestore/serving-batch#example_point-in-time_lookup). In case there are missing values in your past data, you can backfill them using batch serving.

### Check missing data
Recall that response from the batch serve from last ingestion has some missing data in it.

In [None]:
# check the missing data
movie_predictions_df.isna().sum()

### Backfill / correct point-in-time data

Impute the missing data based on thetimestamps.

Note: The timestamp field should must use the RFC 3339 format(e.g. 2012-07-30T10:43:17.123Z) or should be compatible with the Timestamp datatype when loaded to BigQuery. This is because Feature Stores loads to a temporary BigQuery table as an intermediate step when batch serving. Learn more about [loading data to BigQuery from dataframe](https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe).

In [None]:
# Impute the users data
backfill_users_df = pd.DataFrame(
    data=[["bob", 34, "Male", ["Drama"], "2020-02-13 09:35:15+00:00"]],
    columns=["user_id", "age", "gender", "liked_genres", "update_time"],
)
# convert the timefield to datetime64[ns] (with timezone info)
backfill_users_df["update_time"] = pd.to_datetime(backfill_users_df["update_time"])
# display the dataframe
backfill_users_df.head()

In [None]:
# Impute the movies data
backfill_movies_df = pd.DataFrame(
    data=[["movie_04", 4.2, "The Dark Knight", "Action", "2020-02-13 09:35:15+00:00"]],
    columns=["movie_id", "average_rating", "title", "genres", "update_time"],
)
# convert the timefield to datetime64[ns] (with timezone info)
backfill_movies_df["update_time"] = pd.to_datetime(backfill_movies_df["update_time"])
# display the dataframe
backfill_movies_df.head()

### Ingest the backfilled / corrected data

Ingest the imputed point-in-time data from dataframe to the entity types in feature store.

In [None]:
# Ingest the users data
users_entity_type.ingest_from_df(
    feature_ids=["age", "gender", "liked_genres"],
    feature_time="update_time",
    df_source=backfill_users_df,
    entity_id_field="user_id",
)

In [None]:
# Ingest the users data
movies_entity_type.ingest_from_df(
    feature_ids=["average_rating", "title", "genres"],
    feature_time="update_time",
    df_source=backfill_movies_df,
    entity_id_field="movie_id",
)

### Fetch the latest data
Batch serve the ingested backfilled data to a dataframe to ensure the feature store is updated. 

In [None]:
# batch serve the latest data to a dataframe
backfill_movie_predictions_df = movie_predictions_feature_store.batch_serve_to_df(
    serving_feature_ids={
        "users": ["age", "gender", "liked_genres"],
        "movies": ["title", "average_rating", "genres"],
    },
    read_instances_df=read_instances_df,
)
# display the dataframe
backfill_movie_predictions_df.head()

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Vertex AI Feature Store
- Cloud Storage bucket (set `delete_bucket` to True)

In [None]:
# Delete the feature store
movie_predictions_feature_store.delete(force=True)

# remove the local users and movies avro files
! rm {USERS_AVRO_FN} {MOVIES_AVRO_FN}

# Delete Cloud Storage objects that were created
delete_bucket = True
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI