This notebook is developed using the `Python 3 (Data Science)` kernel on an `ml.t3.medium` instance.
## Use case
Auto-mpg data contains data of each car by years. With SM Feature Store, we can easily manage the features over years for each car. There are 56 cars that have records more than 1 year. We will create a feature store for the auto data, ingest the data of their first appearance, then update the feature store with new records by years. After the data is ingested, we show how to access data for training and inference purposes. And how to traverse over time to get feature point-in-time.

In [None]:
import sagemaker
import sys

import boto3
import pandas as pd
import numpy as np
import io
import time
from time import gmtime, strftime, sleep
import datetime

role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = sess.boto_region_name
bucket = sess.default_bucket()
prefix = 'sagemaker-studio-book/chapter04'

Importing data from UCI

In [None]:
data_url='https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
col_names=['mpg','cylinders', 'displacement', 'horsepower', 'weight', 
           'acceleration', 'model_year', 'origin', 'car_name']

df=pd.read_csv(data_url, delimiter='\s+', header=None, names=col_names, na_values='?')

In [None]:
df.sort_values(by=['car_name', 'model_year'])

In [None]:
df['car_name']=df['car_name'].astype('string')

In [None]:
# data frames by years
d_df = {}
for yr in df['model_year'].unique():
    print(yr)
    d_df[str(yr)]=df[df['model_year']==yr]
    d_df[str(yr)]['event_time']=datetime.datetime(1900+yr, 1, 1, 8, 0, 0).timestamp()
#     print(d_df[str(yr)].shape)

In [None]:
d_df['70'].head()

## Create a feature group
We first start by creating feature group names for the auto-mpg data.

In [None]:
timestamp=strftime('%Y-%m-%d-%H-%M-%S', gmtime())

feature_group_name = 'auto-mpg-%s' % timestamp

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup

feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sess)

In [None]:
record_identifier_feature_name = 'car_name'
event_time_feature_name = 'event_time'

In [None]:
feature_group.load_feature_definitions(data_frame=d_df['70'])

In [None]:
description='This feature group tracks the vehicle information such as mpg, and horsepower between 1970 and 1982.'
len(description)
# description has to be less than 128 characters

In [None]:
feature_group.create(
    s3_uri=f's3://{bucket}/{prefix}',
    enable_online_store=True,
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    description=description,
    role_arn=role)

In [None]:
def check_feature_group_status(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group to be Created")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    print(f"FeatureGroup {feature_group.name} successfully created.")

check_feature_group_status(feature_group)

## Ingest data into a feature group

In [None]:
for yr, df_auto in d_df.items():
    print(yr)
    print(df_auto.shape)
    feature_group.ingest(data_frame=df_auto, max_workers=1, max_processes = 1, wait=True)

In [None]:
car_name = 'amc concord'
featurestore_runtime =  sess.boto_session.client(service_name='sagemaker-featurestore-runtime', 
                                                 region_name=region)
sample_record = featurestore_runtime.get_record(
            FeatureGroupName=feature_group_name, 
            RecordIdentifierValueAsString=car_name)

In [None]:
sample_record

To ingest features for a record in a streaming fashion, we could use the put_record API from the sagemaker-featurestore-runtime boto3 API to ingest a single data record, as shown in the following example snippet.
```python
record = [{'FeatureName': 'mpg', 
           'ValueAsString': str(mpg)},
          {'FeatureName':'cylinders', 
           'ValueAsString': str(cylinders)},
          {'FeatureName':'displacement', 
           'ValueAsString': str(displacement)}, 
          {'FeatureName': 'horsepower', 
           'ValueAsString': str(horseposwer)},
          {'FeatureName': 'weight', 
           'ValueAsString': str(weight)},
          {'FeatureName': 'acceleration', 
           'ValueAsString': str(acceleration)},
          {'FeatureName': 'model_year', 
           'ValueAsString': str(model_year)},
          {'FeatureName': 'origin', 
           'ValueAsString': str(origin)},
          {'FeatureName': 'car_name', 
           'ValueAsString': str(car_name)},
          {'FeatureName': 'event_time', 
           'ValueAsString': str(int(round(time.time())))}]
featurestore_runtime.put_record(FeatureGroupName=feature_group_name, 
                                Record=record)
```                                

## Accessing an offline store – building a dataset for analysis and training
SageMaker automatically synchronizes features from the online store to the offline store. It takes up to 15 minutes to populate the offline store. If you run the query below right after the feature ingestion, you may see empty `dataset`. Please try it again in a moment. 

In [None]:
query = feature_group.athena_query()
table_name = query.table_name

query_string = ('SELECT * FROM "%s"' % table_name)
print('Running ' + query_string)

query.run(query_string=query_string,
          output_location=f's3://{bucket}/{prefix}/query_results/')
query.wait()
dataset = query.as_dataframe()

In [None]:
dataset.head()

In [None]:
dataset.shape

In [None]:
query_string_2 = '''
SELECT * FROM "%s" WHERE model_year < 79
''' % table_name
print('Running ' + query_string_2)

query.run(
        query_string=query_string_2,
        output_location=f's3://{bucket}/{prefix}/query_results/')
query.wait()
dataset_2 = query.as_dataframe()

In [None]:
dataset_2.shape

In [None]:
dataset_2.head()

In [None]:
query_string_3='''
SELECT *
FROM
    (SELECT *,
         row_number()
        OVER (PARTITION BY car_name
    ORDER BY  event_time desc, Api_Invocation_Time DESC, write_time DESC) AS row_number
    FROM "%s"
    where event_time < %.f)
WHERE row_number = 1 and
NOT is_deleted
''' % (table_name, datetime.datetime(1979, 1, 1, 8, 0, 0).timestamp())

print('Running ' + query_string_3)

query.run(
        query_string=query_string_3,
        output_location=f's3://{bucket}/{prefix}/query_results/')
query.wait()
dataset_3 = query.as_dataframe()

In [None]:
dataset_3.shape

In [None]:
dataset_2[dataset_2['car_name']=='amc gremlin']

In [None]:
dataset_3[dataset_3['car_name']=='amc gremlin']

## Accessing online store – low-latency feature retrieval

In [None]:
car_name = 'amc gremlin'
featurestore_runtime =  sess.boto_session.client(service_name='sagemaker-featurestore-runtime', 
                                                 region_name=region)
amc_gremlin = featurestore_runtime.get_record(
    FeatureGroupName=feature_group_name, RecordIdentifierValueAsString=car_name)

amc_gremlin['Record']

In [None]:
car_names = ['amc gremlin', 'amc concord', 'dodge colt']
feature_names = ['cylinders', 'displacement', 'horsepower']
sample_batch_records=featurestore_runtime.batch_get_record(
   Identifiers=[
     {
       'FeatureGroupName': feature_group_name,
       'RecordIdentifiersValueAsString': car_names,
       'FeatureNames': feature_names
     },
   ]
)
sample_batch_records['Records'][0]['Record'] # indexing first record

Uncomment the last cell to delete the feature store if no longer needed.

In [None]:
# feature_group.delete()