# Centralized Feature Repository with Amazon SageMaker Feature Store

#### Amazon SageMaker Feature Store is a managed repository with capabilities to store, update, retrieve, and share features. SageMaker Feature Store provides the ability to reuse the engineered features in two different scenarios. First, the features can be shared between the training and inference phases of a single ML project resulting in consistent model inputs and reduced training-serving skew. Second, features from SageMaker.

## Creating feature groups

#### In Amazon SageMaker Feature Store, features are stored in a collection called a feature group. A feature group, in turn, is composed of records of features and feature values. Each record is a collection of feature values, identified by a unique RecordIdentifier value. Every record belonging to a feature group will use the same feature as RecordIdentifier. For example, the record identifier for the feature store created for the weather data could be parameter_id or location_id. Think of RecordIdentifier as a primary key for the feature group. Using this primary key, you can query feature groups for the fast lookup of features. It's also important to note that each record of a feature group must, at a minimum, contain a RecordIdentifier and an event time feature. The event time feature is identified by EventTimeFeatureName when a feature group is set up. When a feature record is ingested into a feature group, SageMaker adds three features – is_deleted, api_invocation time, and write_time – for each feature record. is_deleted is used to manage the deletion of records, api_invocation_time is the time when the API call is invoked to write a record to a feature store, and write_time is the time when the feature record is persisted to the offline store.

#### While each feature group is managed and scaled independently, you can search and discover features from multiple feature groups as long as the appropriate access is in place.

#### When you create a feature store group with SageMaker, you can choose to enable an offline store, online store, or both. When both online and offline stores are enabled, the service replicates the online store contents into the offline store maintained in Amazon S3.

In [None]:
import boto3
import pandas as pd
import numpy as np
import io
import sagemaker
import sys
import json
import time
from time import gmtime, strftime, sleep

from sagemaker.session import Session
from sagemaker import get_execution_role

from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_group import FeatureDefinition
from sagemaker.feature_store.feature_group import FeatureTypeEnum

prefix = 'sagemaker-featurestore-weather'
role = get_execution_role()

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
s3_bucket_name = sagemaker_session.default_bucket()

#Create the service clients
sagemaker_fs_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime')
sagemaker_runtime = sagemaker_session.boto_session.client('sagemaker-runtime')
sagemaker_client = sagemaker_session.boto_session.client('sagemaker')
s3_client = boto3.client('s3', region_name=region)

In [None]:
#Feature group name
location_feature_group_name_offline = 'location-feature-group-offline-' + strftime('%d-%H-%M-%S', gmtime())
location_feature_group_name_online = 'location-feature-group-online-' + strftime('%d-%H-%M-%S', gmtime())
location_feature_group_name_offline_online = 'location-feature-group-offline-online-' + strftime('%d-%H-%M-%S', gmtime())

##Create FeatureDefinitions
fd_location=FeatureDefinition(feature_name='location', feature_type=FeatureTypeEnum('Fractional'))
fd_value=FeatureDefinition(feature_name='city', feature_type=FeatureTypeEnum('Fractional'))
fd_is_mobile=FeatureDefinition(feature_name='ismobile', feature_type=FeatureTypeEnum('Integral'))
fd_source_name=FeatureDefinition(feature_name='sourcename', feature_type=FeatureTypeEnum('Fractional'))
fd_source_type=FeatureDefinition(feature_name='sourcetype', feature_type=FeatureTypeEnum('Fractional'))
fd_event_time=FeatureDefinition(feature_name='EventTime', feature_type=FeatureTypeEnum('Fractional'))

location_feature_definitions = []
location_feature_definitions.append(fd_location)
location_feature_definitions.append(fd_value)
location_feature_definitions.append(fd_is_mobile)
location_feature_definitions.append(fd_source_name)
location_feature_definitions.append(fd_source_type)
location_feature_definitions.append(fd_event_time)

weather_feature_definitions = []
weather_feature_definitions.append(fd_location)
weather_feature_definitions.append(fd_event_time)

##Define unique identifier
record_identifier_feature_name = "location"


#Create offline feature group
location_feature_group_offline = FeatureGroup(name=location_feature_group_name_offline, 
                                     feature_definitions=location_feature_definitions,
                                     sagemaker_session=sagemaker_session)

location_feature_group_offline.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name="location",
    event_time_feature_name="EventTime",
    role_arn=role,
    tags=[{'Key':'project','Value':'weather-prediction'}]
)

#Describe the feature group
location_feature_group_offline.describe()


#Create offline + online feature group
#Note the usage of enable_online_store parameter
location_feature_group_offline_online = FeatureGroup(name=location_feature_group_name_offline_online, 
                                     feature_definitions=location_feature_definitions,
                                     sagemaker_session=sagemaker_session)

location_feature_group_offline_online.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name="location",
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=True,
    tags=[{'Key':'project','Value':'weather-prediction'}]
)

#Describe the feature group
location_feature_group_offline_online.describe()

#Create online feature group
#Note s3_uri flag set to False for the online only FG
location_feature_group_online = FeatureGroup(name=location_feature_group_name_online, 
                                     feature_definitions=location_feature_definitions,
                                     sagemaker_session=sagemaker_session)

location_feature_group_online.create(
    s3_uri=False,
    record_identifier_name="location", 
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=True,
    tags=[{'Key':'project','Value':'weather-prediction'}]
)

#Describe the feature group
location_feature_group_online.describe()

#  List all featuregroups
sagemaker_client.list_feature_groups()

##Create a record to ingest into the feature group
##This ingests features into the online store.
record = []

event_time_feature = {'FeatureName': 'EventTime','ValueAsString': str(int(round(time.time())))}
location_feature =   {'FeatureName': 'location','ValueAsString': str('250')}
ismobile_feature =   {'FeatureName': 'ismobile','ValueAsString': str('1')}
city_feature =      {'FeatureName': 'city','ValueAsString': str('12')}
sourcename_feature =      {'FeatureName': 'sourcename','ValueAsString': str('2.0')}
sourcetype_feature =      {'FeatureName': 'sourcetype','ValueAsString': str('2.0')}

record.append(event_time_feature)
record.append(location_feature)
record.append(ismobile_feature)
record.append(city_feature)
record.append(sourcename_feature)
record.append(sourcetype_feature)

response = sagemaker_fs_runtime_client.put_record(FeatureGroupName=location_feature_group_name_offline_online, 
                                                  Record=record)

response

#### The get_record and batch_get_record APIs should be used with online stores. Additionally, since the underlying storage for an offline store is an S3 bucket, you can query the offline store directly using Athena or other ways of accessing S3. The following code shows a sample Athena query that retrieves all feature records directly from the S3 bucket supporting the offline store:

## Online Vs Offline Feature Store
![My Image](/Users/maukanmir/Documents/Machine-Learning/AI-ML-Textbooks/AI-ML-Learning/images/feature-store.jpg)


## Populating feature groups

#### After creating the feature groups, you will populate them with features. You can ingest features into a feature group using either batch ingestion or streaming ingestion, as shown in Figure 5.5:

#### To ingest features into the feature store, you create a feature pipeline that can populate the feature store. A feature pipeline can include any service or capability that accepts raw data and then transforms that raw data into engineered features and puts the features in a designated feature group. Features can be ingested either in bulk in batches or streamed individually. The PutRecord API call is the core SageMaker API for ingesting features. This is used for both online and offline feature stores as well as ingesting through batch or streaming methods.

#### For batch ingestion, you can author features (for example, using Amazon Data Wrangler) and ingest features in batches using a SageMaker Processing job. This allows batch ingestion into the offline store and the online store. For streaming ingestion, records can be pushed synchronously using the PutRecord API call. When ingesting records to the online feature store, you maintain only the latest feature values for a given record identifier. Historical values are only maintained in the replicated offline store if the feature group is configured for both online and offline stores. Figure 5.6 outlines the methods to ingest features as they relate to the online and offline feature stores:

## Here are the high-level steps involved in the batch ingestion architecture:

- Bulk raw data is available in an S3 bucket.

- The Amazon SageMaker Processing job takes raw data as input and applies feature engineering techniques to the data. The processing job can be configured to run on a distributed cluster of instances to process data at scale.

- The processing job also ingests the engineered features ingested into the online store of the feature group, using the PutRecord API. Features are then automatically replicated to the offline store of the feature group.

- Features from the offline store can then be used for training other models and by other data science teams to address a wide variety of other use cases. Features from the online store can be used for feature lookup during real-time predictions.

- Note that if the feature store used in this architecture is offline only, the processing job can directly write into the offline store using the PutRecord API.

## Here are the high-level steps involved in the streaming ingestion architecture:

- Raw data lands in an S3 bucket, which triggers an AWS Lambda function.
- The Lambda function processes data and inserts features into the online store of the feature group, using the PutRecord API.
- Features are then automatically replicated to the offline store of the feature group.
- Features from the offline store can then be used for training other models and by other data science teams to address a wide variety of other use cases. Features from the online store can be used for feature lookup during real-time predictions.

#### In addition to using the ingestion APIs to populate the offline store, you can populate the underlying S3 bucket directly. If you don't have a need for real-time inference and have huge volumes of historical feature data (terabytes or even hundreds of gigabytes) that you want to migrate to an offline feature store to be used for training models, you can directly upload them to the underlying S3 bucket. To do this effectively, it is important to understand the S3 folder structure of the offline bucket. Feature groups in the offline store are organized in the structure s3:

In [2]:
destination = """s3://<bucket-name>/<customer-prefix>/<account-id>/sagemaker/<aws-region>
/offline-store/<feature-group-name>-<feature-group-creation-time>
/data/year=<event-time-year>/month=<event-time-month>/day=<event-time-day>/hour=<event-time-hour
>/<timestamp_of_latest_event_time_in_file>_<16-random-alphanumeric-digits>.parquet"""

#### Also note that, when you use ingestion APIs, the features isdeleted, api_invocation_time, and write-time are included automatically in the feature record, but when you write directly to the offline store, you are responsible for including them.

## Retrieving features from feature groups
- Once feature groups are populated, to retrieve features from the feature store, there are two APIs available – get_record and batch_get_record.

## Use get_record from the online store

In [None]:
record_identifier_value = str('250')
response = sagemaker_fs_runtime_client.get_record(
FeatureGroupName=location_feature_group_name_offline_online, 
RecordIdentifierValueAsString=record_identifier_value)
response

## Use batch-get_record

In [None]:
#Use batch-get_record
record_identifier_values = ["200", "250", "300"]
response=sagemaker_fs_runtime_client.batch_get_record(
    Identifiers=[
        {"FeatureGroupName": location_feature_group_name_offline_online, "RecordIdentifiersValueAsString": record_identifier_values}
    ]
)
response

## Creating reusable features to reduce feature inconsistencies and inference latency

#### One of the challenges data scientists face is the long data processing time – hours and sometimes days – necessary for preparing features to be used for ML training. Additionally, the data processing steps applied in feature engineering need to be applied to the inference requests during prediction time, which increases the inference latency. Each data science team will need to spend this data processing time even when they use the same raw data for different models. In this section, we will discuss best practices to address these challenges by using Amazon SageMaker Feature Store.

#### For use cases that require low latency features for inference, an online feature store should be configured, and it's generally recommended to enable both the online and offline feature store. A feature store enabled with both online and offline stores allows you to reuse the same feature values for the training and inference phases. This configuration reduces the inconsistencies between the two phases and minimizes training and inference skew. In this mode, to populate the store, ingest features into the online store either using batch or streaming.

#### As you ingest features into an online store, SageMaker automatically replicates feature values to an offline store, continuously appending the latest values. It's important to note that for the online feature store, only the most current feature record is maintained and the PutRecord API is always processed as insert/upsert. This is key because if you need to update a feature record, the process to do so is to re-insert or overlay the existing record. This is to allow the retrieval of features with the minimum possible latency for inference use cases.

#### Although the online feature store maintains only the latest record, the offline store will provide a full history of feature values over time. Records will stay in the offline store until they are explicitly removed. As a result, you should establish a process to prune unnecessary records in the offline feature store using the standard mechanisms provided for S3 archival.

#### Another best practice is to set up standards for versioning features. As features evolve, it is important to keep track of feature versions. Consider versioning at two levels – versions of the feature group itself and versions of features within a feature group. You need to create a new version of the feature group for when the schema of the features change, such as when feature definitions need to be added or deleted.

#### At the time of this book's publication, feature groups are immutable. To add or remove features, you will need to create a new feature group. To address the requirement of multiple versions of a feature group with different numbers of features, establish and stick to naming conventions. For example, you could create a weather-conditions-v1 feature group initially. When that feature group needs to be updated, you can create a new weather-conditions-v2 feature group. You can also consider adding descriptive labels on data readiness or usage, such as weather-conditions-latest-v2 or weather-conditions-stable-v2. You also can tag feature groups to provide metadata. Additionally, you should also establish standards for how many concurrent versions to support and when to deprecate old versions.

#### For the versioning of the individual features, the offline store keeps a history of all values of the features in a feature group. Each feature record is required to have an eventTime, which supports the ability to access feature versions by date. To retrieve previous version values of features from the offline store, use an Athena query with a specific timestamp, as shown in the following code block:

## Designing solutions for near real-time ML predictions

#### Sometimes machine learning applications demand high-throughput updates to features and near real-time access to the updated features. Timely access to fast-changing features is critical for the accuracy of predictions made by these applications. As an example, consider a machine learning application in a call center that predicts how to route the incoming customer calls to available agents. This application needs to have knowledge of the customer's latest web session clicks to make accurate routing decisions. If you capture a customer's web-click behavior as features, the features need to be updated instantly and the application needs access to the updated features in near-real time. Similarly, for weather prediction problems, you may want to capture the weather measurement features frequently for accurate weather predictions and need the ability to look up features in real time.

#### Let's look at some best practices in designing a reliable solution that meets the requirement of high-throughput writes and low-latency reads. At a high level, this solution will couple streaming ingestion into a feature group with streaming predictions. We will discuss the best practices to apply to ingestion into and serving from a feature store.

#### For ingesting features, the decision to choose between batch and streaming ingestion should be based on how often feature values in the feature store need to be updated for use by downstream training or inference. While simple machine models may need features from a single feature group, if you are working with data from multiple sources, you will find yourself using features from multiple feature groups. Some of these features need to be updated on a periodic basis (hourly, daily, weekly) and others must be streamed in near-real time.

#### Feature update frequency and inference access patterns should also be used as a consideration for creating different feature groups and isolating features. By isolating features that need to be inserted on different schedules, the ingestion throughput for streaming features can be improved independently. However, retrieving values from multiple feature groups increases the number of API calls and can increase overall retrieval times.

#### Your solution needs to balance feature isolation and retrieval performance. If your models require features from a large number of different feature groups at inference, design the solution to utilize larger feature groups or to retrieve from the feature store in parallel to meet the near real-time SLAs for predictions. For example, if your model requires features from three feature groups for inference, you can issue three API calls to get the feature record data in parallel before merging that data for model inference. This can be done through a typical inference workflow executing through an AWS service such as AWS Step Functions. Optionally, if that same set of features are always used together for inference, you may want to consider combining those into a single feature group.

