### Creating features and training data with Amazon SageMaker Featurestore

This notebook demonstrates basic capabilities of Amazon Feature Store - creating a feature group, ingesting data into a feature group and retrieving features from the feature group to be used with a SageMaker training job and during inferences.  These basic capabilities are used to demonstrate batch ingestion of features into the feature store.

#### Feature Store Design

For the dataset used in this book, we will use two feature groups - location and weather data. 

The location feature group will have "location_id" as the record-identifier and   captures features related to the location such as city.  Weather data feature group will also have "location_id" as the record-identifier and captures the weather quality measurements such as pm25.  

Splitting up data into two different feature groups allows us to use the feature groups across multiple ML projects.  For example, features from both location and weather data feature groups are  used for a regression model to predict future weather measurements for a given location. On the other hand, features from the weather data feature group can also be used for a clustering model to find stations with similar measurements.

In this notebook we will show how to create the two feature groups, ingest features into the feature groups, retrieve data from feature groups to create training data, train and deploy the deploy and finally how to use features during inference.


### Overview
1. Set up
2. Create feature groups
3. Ingest data into feature groups
4. Retrive data from feature groups
5. Train and deploy the model
6. Use features from feature store during inference
7. Clean up

### 1. Set up

#### 1.1. Imports

In [2]:
import boto3
import pandas as pd
import numpy as np
import io
import sagemaker
import sys
import json
import time
from time import gmtime, strftime, sleep

from sagemaker.session import Session
from sagemaker import get_execution_role

from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.feature_store.feature_group import FeatureDefinition
from sagemaker.feature_store.feature_group import FeatureTypeEnum

from sagemaker.inputs import TrainingInput

from sagemaker.spark.processing import PySparkProcessor

#### 1.2 Install required version of sagemaker libraries

In [3]:
# SageMaker Python SDK version 2.x is required
original_version = sagemaker.__version__
%pip install 'sagemaker>=2.0.0'

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


#### 1.3 Setup variables

In [4]:
prefix = 'sagemaker-featurestore-weather'
role = get_execution_role()

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
s3_bucket_name = sagemaker_session.default_bucket()

#Set the s3_bucket to the correct bucket name created in your datascience environment
s3_bucket = 'datascience-environment-notebookinstance--06dc7a0224df'
s3_prefix = 'prepared'

#### 1.4 Setup service clients

In [5]:
#Create the service clients
sagemaker_fs_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime')
sagemaker_runtime = sagemaker_session.boto_session.client('sagemaker-runtime')
sagemaker_client = sagemaker_session.boto_session.client('sagemaker')
s3_client = boto3.client('s3', region_name=region)

### 2. Create feature groups

We will create two feature groups - location feature group and weather data feature group.  Location FG will capture location details and will have one row for each location.  Weather data feature group will capture weather measurements for each location and will have averaged measurements for each day.


In [6]:
#Feature group name
location_feature_group_name = 'location-feature-group-' + strftime('%d-%H-%M-%S', gmtime())

In [7]:
##Create FeatureDefinitions
fd_location=FeatureDefinition(feature_name='location', feature_type=FeatureTypeEnum('Fractional'))
fd_city=FeatureDefinition(feature_name='city', feature_type=FeatureTypeEnum('Fractional'))
fd_country=FeatureDefinition(feature_name='country', feature_type=FeatureTypeEnum('Fractional'))
fd_source_name=FeatureDefinition(feature_name='sourcename', feature_type=FeatureTypeEnum('Fractional'))
fd_source_type=FeatureDefinition(feature_name='sourcetype', feature_type=FeatureTypeEnum('Fractional'))
fd_event_time=FeatureDefinition(feature_name='EventTime', feature_type=FeatureTypeEnum('Fractional'))

location_feature_definitions = []
location_feature_definitions.append(fd_location)
location_feature_definitions.append(fd_city)
location_feature_definitions.append(fd_source_name)
location_feature_definitions.append(fd_country)
location_feature_definitions.append(fd_source_type)
location_feature_definitions.append(fd_event_time)

In [8]:
#Create offline + online feature group
location_feature_group = FeatureGroup(name=location_feature_group_name, 
                                     feature_definitions=location_feature_definitions,
                                     sagemaker_session=sagemaker_session)

location_feature_group.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name='location',
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=True,
    tags=[{'Key':'project','Value':'weather-prediction'}]
)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-west-2:802439482869:feature-group/location-feature-group-07-16-12-24',
 'ResponseMetadata': {'RequestId': 'ffe77a09-e5a9-424f-93b4-26a2a765e8fb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ffe77a09-e5a9-424f-93b4-26a2a765e8fb',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '111',
   'date': 'Sat, 07 Aug 2021 16:12:26 GMT'},
  'RetryAttempts': 0}}

In [9]:
#Describe the feature group
location_feature_group.describe()

{'FeatureGroupArn': 'arn:aws:sagemaker:us-west-2:802439482869:feature-group/location-feature-group-07-16-12-24',
 'FeatureGroupName': 'location-feature-group-07-16-12-24',
 'RecordIdentifierFeatureName': 'location',
 'EventTimeFeatureName': 'EventTime',
 'FeatureDefinitions': [{'FeatureName': 'location',
   'FeatureType': 'Fractional'},
  {'FeatureName': 'city', 'FeatureType': 'Fractional'},
  {'FeatureName': 'sourcename', 'FeatureType': 'Fractional'},
  {'FeatureName': 'country', 'FeatureType': 'Fractional'},
  {'FeatureName': 'sourcetype', 'FeatureType': 'Fractional'},
  {'FeatureName': 'EventTime', 'FeatureType': 'Fractional'}],
 'CreationTime': datetime.datetime(2021, 8, 7, 16, 12, 26, 597000, tzinfo=tzlocal()),
 'OnlineStoreConfig': {'EnableOnlineStore': True},
 'OfflineStoreConfig': {'S3StorageConfig': {'S3Uri': 's3://sagemaker-us-west-2-802439482869/sagemaker-featurestore-weather',
   'ResolvedOutputS3Uri': 's3://sagemaker-us-west-2-802439482869/sagemaker-featurestore-weather/80

In [10]:
#Feature group name
weather_data_feature_group_name = 'weather-data-feature-group-' + strftime('%d-%H-%M-%S', gmtime())

In [11]:
##Create FeatureDefinitions
fd_value=FeatureDefinition(feature_name='value', feature_type=FeatureTypeEnum('Fractional'))
fd_is_bad_air=FeatureDefinition(feature_name='isbadair', feature_type=FeatureTypeEnum('Integral'))
fd_is_mobile=FeatureDefinition(feature_name='ismobile', feature_type=FeatureTypeEnum('Integral'))
fd_year=FeatureDefinition(feature_name='year', feature_type=FeatureTypeEnum('Integral'))
fd_month=FeatureDefinition(feature_name='month', feature_type=FeatureTypeEnum('Integral'))
fd_quarter=FeatureDefinition(feature_name='quarter', feature_type=FeatureTypeEnum('Integral'))
fd_day=FeatureDefinition(feature_name='day', feature_type=FeatureTypeEnum('Integral'))


fd_no2=FeatureDefinition(feature_name='no2', feature_type=FeatureTypeEnum('Fractional'))
fd_o3=FeatureDefinition(feature_name='o3', feature_type=FeatureTypeEnum('Fractional'))
fd_pm10=FeatureDefinition(feature_name='pm10', feature_type=FeatureTypeEnum('Fractional'))
fd_pm25=FeatureDefinition(feature_name='pm25', feature_type=FeatureTypeEnum('Fractional'))
fd_so2=FeatureDefinition(feature_name='so2', feature_type=FeatureTypeEnum('Fractional'))
fd_co=FeatureDefinition(feature_name='co', feature_type=FeatureTypeEnum('Fractional'))
fd_bc=FeatureDefinition(feature_name='bc', feature_type=FeatureTypeEnum('Fractional'))


fd_location=FeatureDefinition(feature_name='location', feature_type=FeatureTypeEnum('Fractional'))
fd_event_time=FeatureDefinition(feature_name='EventTime', feature_type=FeatureTypeEnum('Fractional'))

weather_data_feature_definitions = []
weather_data_feature_definitions.append(fd_value)
weather_data_feature_definitions.append(fd_is_bad_air)
weather_data_feature_definitions.append(fd_is_mobile)
weather_data_feature_definitions.append(fd_year)
weather_data_feature_definitions.append(fd_month)
weather_data_feature_definitions.append(fd_quarter)
weather_data_feature_definitions.append(fd_day)

weather_data_feature_definitions.append(fd_no2)
weather_data_feature_definitions.append(fd_o3)
weather_data_feature_definitions.append(fd_pm10)
weather_data_feature_definitions.append(fd_pm25)
weather_data_feature_definitions.append(fd_so2)
weather_data_feature_definitions.append(fd_co)
weather_data_feature_definitions.append(fd_bc)

weather_data_feature_definitions.append(fd_location)
weather_data_feature_definitions.append(fd_event_time)

In [12]:
#Create offline + online feature group
weather_data_feature_group = FeatureGroup(name=weather_data_feature_group_name, 
                                     feature_definitions=weather_data_feature_definitions,
                                     sagemaker_session=sagemaker_session)

weather_data_feature_group.create(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}",
    record_identifier_name='location',
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=True,
    tags=[{'Key':'project','Value':'weather-prediction'}]
)

{'FeatureGroupArn': 'arn:aws:sagemaker:us-west-2:802439482869:feature-group/weather-data-feature-group-07-16-12-28',
 'ResponseMetadata': {'RequestId': '10a15c92-f7f7-419d-8d91-2dcc0c7e8d3f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '10a15c92-f7f7-419d-8d91-2dcc0c7e8d3f',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '115',
   'date': 'Sat, 07 Aug 2021 16:12:29 GMT'},
  'RetryAttempts': 0}}

In [13]:
##Check status of the feature group
def check_feature_group_status(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group to be Created")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    print(f"FeatureGroup {feature_group.name} successfully created.")

In [14]:
##Wait till both feature groups are ready
check_feature_group_status(location_feature_group)
check_feature_group_status(weather_data_feature_group)

Waiting for Feature Group to be Created
Waiting for Feature Group to be Created
FeatureGroup location-feature-group-07-16-12-24 successfully created.
Waiting for Feature Group to be Created
FeatureGroup weather-data-feature-group-07-16-12-28 successfully created.


### 3. Ingest data into feature groups

#### 3.1 Examine data

In [15]:
##Get the first file name in the 'prefix' folder
def get_file_in_bucket(prefix,index):
    response = s3_client.list_objects(
        Bucket=s3_bucket,
        Prefix=s3_prefix+'/'+prefix
    )
    ## At '0' index you will find the SUCCESS/FAILURE of file uploades to S3. First data file is at index 1
    file_name = response['Contents'][index]['Key']
    print("Returing file name : " + file_name)
    return file_name

In [17]:
#Since we have large volumes of data and ingestion takes a long time, we are using just a single file to show 
#the feature store concepts to save time and cost.  If you would like to experiment with more data, you can extend the below 
#functionality to get multiple files from S3 bucket, loop through them and create a combined dataframe.
use_full_data = False
if use_full_data == False:
    #Read a sample csv from S3
    s3_path = "s3://{}/{}".format(s3_bucket, get_file_in_bucket('train',1))
    
print(s3_path)


Returing file name : prepared/train/part-00000-2554f113-947e-46bd-be31-9cd75cb4661c-c000.csv
s3://datascience-environment-notebookinstance--06dc7a0224df/prepared/train/part-00000-2554f113-947e-46bd-be31-9cd75cb4661c-c000.csv


In [18]:
prepared_data_df = pd.read_csv(s3_path)

#View the first few rows of the dataframe
prepared_data_df = pd.read_csv(s3_path,header=None)
prepared_data_df.head(100)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.0,0,2015,10,4,10,0,1210.0,731.0,10.0,9.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0,2015,10,4,10,0,1210.0,731.0,10.0,9.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0,2015,10,4,10,0,155.0,14.0,21.0,21.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0,2015,10,4,10,0,155.0,14.0,21.0,21.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0,2015,10,4,10,0,155.0,14.0,21.0,21.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0,2015,10,4,10,0,349.0,823.0,8.0,8.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
96,0.0,0,2015,10,4,10,0,350.0,824.0,8.0,8.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
97,0.0,0,2015,10,4,10,0,350.0,824.0,8.0,8.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
98,0.0,0,2015,10,4,10,0,351.0,825.0,8.0,8.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


There are no headers in the prepared data.  But from the previous chapter we know the colummns represent 

value,ismobile,year,month,quarter,day,isBadAir,location,city,sourcename,country,sourcetype,o3,no2,so2,co,pm10,pm25,bc

So lets set columns on the dataframe

In [19]:
prepared_data_df.columns=['value','ismobile',
                          'year','month','quarter','day',
                          'isBadAir','location','city','country', 'sourcename','sourcetype',
                          'o3','no2','so2','pm10','pm25','co','bc']

In [20]:
prepared_data_df.shape

(772304, 19)

#### 3.2 Batch Ingest features into location and weather data FG using a processing job

##### Create a batch_ingestion.py file which will process the data and ingest into feature groups

In [21]:
%%writefile batch_ingestion.py
from pyspark.sql.types import StructField, StructType, StringType, DoubleType, TimestampType, LongType
from pyspark.sql.functions import desc, dense_rank
from pyspark.sql import SparkSession, DataFrame
from  argparse import Namespace, ArgumentParser
from pyspark.sql.window import Window
import argparse
import logging
import boto3
import time
import sys
import os

logger = logging.getLogger('sagemaker')
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())


feature_store_client = boto3.client(service_name='sagemaker-featurestore-runtime')


def parse_args() -> Namespace:
    parser = ArgumentParser(description='Spark Job Input and Output Args')
    parser.add_argument('--s3_input_bucket', type=str, help='S3 Input Bucket')
    parser.add_argument('--s3_path', type=str, help='Path to files in S3')
    parser.add_argument('--weather_data_feature_group', type=str, help='Weather data feature group name')
    parser.add_argument('--location_feature_group', type=str, help='Location feature group name')
   
    args = parser.parse_args()
    return args

def write_to_weather_data_feature_store(args: Namespace, spark: SparkSession) -> None:
    weather_data_feature_group = args.weather_data_feature_group
        
    logger.info('[Writing to weather data feature store]')
    success, fail = 0, 0
    
    prepared_df = spark.read.csv(args.s3_path, \
                                header=True)
    
    logger.info('#### [got prepared_df ] #####')
    
    prepared_df_with_col_headers = prepared_df.toDF('value','ismobile',
                          'year','month','quarter','day',
                          'isBadAir','location','city','country', 'sourcename','sourcetype',
                          'o3','no2','so2','pm10','pm25','co','bc')    
    
    ##Prepare the dataframe for weather data : start from prepared data, avg out per day values for the measurements

    prev_year=""
    prev_month=""
    prev_quarter=""
    prev_day=""
    prev_location=""
    prev_ismobile=""

    total_value=0.0
    total_isBadAir=0
    total_o3=0.0
    total_no2=0.0
    total_pm10=0.0
    total_pm25=0.0
    total_so2=0.0
    total_co=0.0
    total_bc=0.0
    total_count = 0

    ##Use link to collect aggregated weather data
    weather_data=[]

    for row in prepared_df_with_col_headers.collect():
        
        current_year = row.year
        current_month = row.month
        current_quarter = row.quarter
        current_day = row.day
        current_location = row.location
        current_ismobile = row.ismobile

        if(current_location == prev_location and 
           current_year == prev_year and
           current_month == prev_month and 
           current_quarter == prev_quarter and
           current_day == prev_day and
           current_ismobile == prev_ismobile):
           #print("location - Same year, collecting for avg")

            total_value+=float(row['value'])
            total_isBadAir+=float(row['isBadAir'])
            ##Get the measurement value from series
            total_o3+=float(row['o3'])
            total_no2+=float(row['no2'])
            total_pm10+=float(row['pm10'])
            total_pm25+=float(row['pm25'])
            total_so2+=float(row['so2'])
            total_co+=float(row['co'])
            total_bc+=float(row['bc'])
            total_count +=1
        else:
            ##Add aggregated / averaged weather measurements to the list, before moving on to the next location
            if total_count != 0:
                #print("adding weather data to list")
                weather_data.append([prev_location,
                                     int(prev_year),
                                     int(prev_month),
                                     int(prev_quarter),
                                     int(prev_day),
                                     int(prev_ismobile),
                                     total_value/total_count,
                                     (int)(total_isBadAir/total_count),
                                     total_o3/total_count,
                                     total_no2/total_count,
                                     total_pm10/total_count,
                                     total_pm25/total_count,
                                     total_so2/total_count,
                                     total_co/total_count,
                                     total_bc/total_count
                                    ])

            total_value=0.0
            total_isBadAir=0
            total_o3=0.0
            total_no2=0.0
            total_pm10=0.0
            total_pm25=0.0
            total_so2=0.0
            total_co=0.0
            total_bc=0.0
            total_count = 0   

        prev_year=current_year
        prev_month=current_month
        prev_quarter=current_quarter
        prev_day=current_day
        prev_location=current_location
        prev_ismobile=current_ismobile

    ##Capture the last location
    if total_count != 0:
        weather_data.append([prev_location,
                                     int(prev_year),
                                     int(prev_month),
                                     int(prev_quarter),
                                     int(prev_day),
                                     int(prev_ismobile),
                                     total_value/total_count,
                                     (int)(total_isBadAir/total_count),
                                     total_o3/total_count,
                                     total_no2/total_count,
                                     total_pm10/total_count,
                                     total_pm25/total_count,
                                     total_so2/total_count,
                                     total_co/total_count,
                                     total_bc/total_count
                                    ])

    print(weather_data)

    #Create dataframe from weather data
    weather_columns=['location','year','month','quarter','day','ismobile','value','isbadair','o3','no2','pm10','pm25','so2','co','bc'] 
    
    weather_data_df = spark.createDataFrame(data=weather_data, schema = weather_columns)
        
    data_collect = weather_data_df.collect()
    
  
    #looping thorough each row of the dataframe
    for row in data_collect:
        #print(row)
        
        record = []
        event_time_feature = {'FeatureName': 'EventTime','ValueAsString': str(int(round(time.time())))}
        location_feature = {'FeatureName': 'location','ValueAsString': str(row['location'])}
        
        year_feature = {'FeatureName': 'year','ValueAsString': str(row['year'])}
        month_feature = {'FeatureName': 'month','ValueAsString': str(row['month'])}
        quarter_feature = {'FeatureName': 'quarter','ValueAsString': str(row['quarter'])}
        day_feature = {'FeatureName': 'day','ValueAsString': str(row['day'])}
        
        ismobile_feature = {'FeatureName': 'ismobile','ValueAsString': str(row['ismobile'])}
        value_feature = {'FeatureName': 'value','ValueAsString': str(row['value'])}
        isbadair_feature = {'FeatureName': 'isbadair','ValueAsString': str(row['isbadair'])}
        
        o3_feature = {'FeatureName': 'o3','ValueAsString': str(row['o3'])}
        no2_feature = {'FeatureName': 'no2','ValueAsString': str(row['no2'])}
        pm10_feature = {'FeatureName': 'pm10','ValueAsString': str(row['pm10'])}
        pm25_feature = {'FeatureName': 'pm25','ValueAsString': str(row['pm25'])}
        so2_feature = {'FeatureName': 'so2','ValueAsString': str(row['so2'])}
        co_feature = {'FeatureName': 'co','ValueAsString': str(row['co'])}
        bc_feature = {'FeatureName': 'bc','ValueAsString': str(row['bc'])}
       
    
        record.append(event_time_feature)
        record.append(location_feature)
        
        record.append(year_feature)
        record.append(month_feature)
        record.append(quarter_feature)
        record.append(day_feature)
        
        record.append(ismobile_feature)
        record.append(value_feature)
        record.append(isbadair_feature)
        
        record.append(o3_feature)
        record.append(no2_feature)
        record.append(pm10_feature)
        record.append(pm25_feature)
        record.append(so2_feature)
        record.append(co_feature)
        record.append(bc_feature)
        
        response = feature_store_client.put_record(FeatureGroupName=weather_data_feature_group, Record=record)
        if response['ResponseMetadata']['HTTPStatusCode'] == 200:
            success += 1
        else:
            fail += 1
    
    logger.info('Success = {}'.format(success))
    logger.info('Fail = {}'.format(fail))

def write_to_location_feature_store(args: Namespace, spark: SparkSession) -> None:
    location_feature_group = 'location-feature-group-31-01-04-05' ##TODO : Need to make this a parameter
    location_feature_group = args.location_feature_group
    logger.info('[Writing to location feature store]')
    success, fail = 0, 0
    
    prepared_df = spark.read.csv(args.s3_path, \
                                header=True)
    
    logger.info('#### [got prepared_df ] #####')
    
    prepared_df_with_col_headers = prepared_df.toDF('value','ismobile',
                          'year','month','quarter','day',
                          'isBadAir','location','city','country', 'sourcename','sourcetype',
                          'o3','no2','so2','pm10','pm25','co','bc')
      
    #Extract unique locations from prepared_df
    unique_locations_df=prepared_df_with_col_headers.dropDuplicates(['location'])
    
    logger.info('#### [got unique locations ] #####')

    location_group_features = ['location','city','sourcename','country','sourcetype']
    unique_locations_df=unique_locations_df[location_group_features]

    logger.info("###### After reading the csv file")
    unique_locations_df.printSchema()
    logger.info("###### After priniting the schema")
    
    data_collect = unique_locations_df.collect()
    
    logger.info("###### After data_collect")    
    
    #looping thorough each row of the dataframe
    for row in data_collect:
        #print(row)
        
        record = []
        event_time_feature = {'FeatureName': 'EventTime','ValueAsString': str(int(round(time.time())))}
        location_feature = {'FeatureName': 'location','ValueAsString': str(row['location'])}
        city_feature = {'FeatureName': 'city','ValueAsString': str(row['city'])}
        country_feature = {'FeatureName': 'country','ValueAsString': str(row['country'])}
        source_type_feature = {'FeatureName': 'sourcetype','ValueAsString': str(row['sourcetype'])}
        source_name_feature = {'FeatureName': 'sourcename','ValueAsString': str(row['sourcename'])}
        
        record.append(event_time_feature)
        record.append(location_feature)
        record.append(city_feature)
        record.append(source_type_feature)
        record.append(country_feature)
        record.append(source_name_feature)
        
        response = feature_store_client.put_record(FeatureGroupName=location_feature_group, Record=record)
        if response['ResponseMetadata']['HTTPStatusCode'] == 200:
            success += 1
        else:
            fail += 1
    
    logger.info('Success = {}'.format(success))
    logger.info('Fail = {}'.format(fail))

def run_spark_job():
    spark = SparkSession.builder.appName('PySparkJob').getOrCreate()
    args = parse_args()
    write_to_location_feature_store(args,spark)
    write_to_weather_data_feature_store(args,spark)
    
    
    
if __name__ == '__main__':
    run_spark_job()

Writing batch_ingestion.py


In [22]:
s3_path

's3://datascience-environment-notebookinstance--06dc7a0224df/prepared/train/part-00000-2554f113-947e-46bd-be31-9cd75cb4661c-c000.csv'

#####  Define PySpark Processor

In [23]:
spark_processor = PySparkProcessor(base_job_name='sagemaker-processing', 
                                   framework_version='2.4', # spark version
                                   role=role, 
                                   instance_count=1, 
                                   instance_type='ml.r5.4xlarge', 
                                   env={'AWS_DEFAULT_REGION': boto3.Session().region_name},
                                   max_runtime_in_seconds=6000)

##### Execute the processing job

In [24]:
%%time
##With a single file to process, this step takes about 18 minutes on one 'ml.r5.4xlarge'
spark_processor.run(submit_app='batch_ingestion.py', 
                    arguments=['--s3_input_bucket', s3_bucket, 
                               '--s3_path', s3_path,
                               '--location_feature_group', location_feature_group_name,
                               '--weather_data_feature_group', weather_data_feature_group_name],
                    spark_event_logs_s3_uri='s3://{}/logs'.format(s3_bucket),
                    logs=True)


Job Name:  sagemaker-processing-2021-08-07-16-14-27-312
Inputs:  [{'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-802439482869/sagemaker-processing-2021-08-07-16-14-27-312/input/code/batch_ingestion.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://datascience-environment-notebookinstance--06dc7a0224df/logs', 'LocalPath': '/opt/ml/processing/spark-events/', 'S3UploadMode': 'Continuous'}}]
............................[34m08-07 16:18 smspark.cli  INFO     Parsing arguments. argv: ['/usr/local/bin/smspark-submit', '--local-spark-event-logs-dir', '/opt/ml/processing/spark-events/', '/opt/ml/processing/input/code/batch_ingestion.py', '--s3_input_bucket', 'datascience-environment-notebookinstance--06dc7a0224df', '--s3_path', 's3://datascie

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[34m21/08/07 16:20:41 INFO spark.SparkContext: Starting job: collect at /opt/ml/processing/input/code/batch_ingestion.py:163[0m
[34m21/08/07 16:20:41 INFO scheduler.DAGScheduler: Got job 5 (collect at /opt/ml/processing/input/code/batch_ingestion.py:163) with 32 output partitions[0m
[34m21/08/07 16:20:41 INFO scheduler.DAGScheduler: Final stage: ResultStage 6 (collect at /opt/ml/processing/input/code/batch_ingestion.py:163)[0m
[34m21/08/07 16:20:41 INFO scheduler.DAGScheduler: Parents of final stage: List()[0m
[34m21/08/07 16:20:41 INFO scheduler.DAGScheduler: Missing parents: List()[0m
[34m21/08/07 16:20:41 INFO scheduler.DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[38] at collect at /opt/ml/processing/input/code/batch_ingestion.py:163), which has no missing parents[0m
[34m21/08/07 16:20:41 INFO memory.MemoryStore: Block broadcast_11 stored as values in memory (estimated size 9.9 KB, free 1007.2 MB)[0m
[34m21/08/07 16:20:41 INFO memory.MemoryStore: Block bro

### 4. Retrive data from feature groups

Retrive data from location and weather data feature groups and create training data.

#### Build Training Dataset

SageMaker FeatureStore automatically builds the Glue Data Catalog for FeatureGroups (you can optionally turn it on/off while creating the FeatureGroup). In this example, we want to create a training dataset with FeatureValues from both weather FeatureGroups. This is done by utilizing the auto-built Catalog. 

Run an Athena query that queries for the features stored in the offline store in S3 from the FeatureGroup.

When querying  from the offline store, note that the replication between online and offline store will take upto 15 minutes.


In [25]:
location_query = location_feature_group.athena_query()
location_table = location_query.table_name

#Query string 
query_string = 'SELECT * FROM "'+ location_table + '"'
print('Running ' + query_string)

Running SELECT * FROM "location-feature-group-07-16-12-24-1628352746"


In [26]:
# Run Athena query. The output is loaded to a Pandas dataframe.
location_query.run(query_string=query_string, output_location='s3://'+s3_bucket_name+'/'+prefix+'/query_results/')
location_query.wait()
location_df = location_query.as_dataframe()

location_df

Unnamed: 0,location,city,sourcename,country,sourcetype,eventtime,write_time,api_invocation_time,is_deleted
0,6228.0,672.0,5.0,5.0,0.0,1.628353e+09,2021-08-07 16:24:20.717,2021-08-07 16:19:26.000,False
1,6565.0,984.0,27.0,26.0,0.0,1.628353e+09,2021-08-07 16:24:20.717,2021-08-07 16:19:26.000,False
2,754.0,480.0,7.0,7.0,0.0,1.628353e+09,2021-08-07 16:24:20.717,2021-08-07 16:19:26.000,False
3,8020.0,315.0,6.0,6.0,0.0,1.628353e+09,2021-08-07 16:24:20.717,2021-08-07 16:19:27.000,False
4,8570.0,237.0,5.0,5.0,0.0,1.628353e+09,2021-08-07 16:24:20.717,2021-08-07 16:19:27.000,False
...,...,...,...,...,...,...,...,...,...
5239,7136.0,1141.0,28.0,27.0,0.0,1.628353e+09,2021-08-07 16:24:20.631,2021-08-07 16:20:11.000,False
5240,315.0,25.0,2.0,2.0,0.0,1.628353e+09,2021-08-07 16:24:20.631,2021-08-07 16:20:12.000,False
5241,7715.0,984.0,27.0,26.0,0.0,1.628353e+09,2021-08-07 16:24:20.631,2021-08-07 16:20:13.000,False
5242,8099.0,122.0,6.0,6.0,0.0,1.628353e+09,2021-08-07 16:24:20.631,2021-08-07 16:20:13.000,False


In [27]:
weather_data_query = weather_data_feature_group.athena_query()
weather_data_table = weather_data_query.table_name

#Query string 
query_string = 'SELECT * FROM "'+ weather_data_table + '"'
print('Running ' + query_string)

Running SELECT * FROM "weather-data-feature-group-07-16-12-28-1628352750"


In [28]:
# run Athena query. The output is loaded to a Pandas dataframe.
weather_data_query.run(query_string=query_string, output_location='s3://'+s3_bucket_name+'/'+prefix+'/query_results/')
weather_data_query.wait()
weather_df = weather_data_query.as_dataframe()

weather_df

Unnamed: 0,value,isbadair,ismobile,year,month,quarter,day,no2,o3,pm10,pm25,so2,co,bc,location,eventtime,write_time,api_invocation_time,is_deleted
0,0.0,0,0,2018,1,1,5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,250.0,1.628354e+09,2021-08-07 16:35:41.043,2021-08-07 16:30:41.000,False
1,0.0,0,0,2018,1,1,5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,55.0,1.628354e+09,2021-08-07 16:35:41.043,2021-08-07 16:30:41.000,False
2,0.0,0,0,2018,1,1,6,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1088.0,1.628354e+09,2021-08-07 16:35:41.043,2021-08-07 16:30:42.000,False
3,0.0,0,0,2018,1,1,6,1.0,0.0,0.0,0.0,0.0,0.0,0.0,250.0,1.628354e+09,2021-08-07 16:35:41.043,2021-08-07 16:30:42.000,False
4,0.0,0,0,2018,1,1,6,1.0,0.0,0.0,0.0,0.0,0.0,0.0,4538.0,1.628354e+09,2021-08-07 16:35:41.043,2021-08-07 16:30:44.000,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68308,0.0,0,0,2017,11,4,11,1.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,1.628354e+09,2021-08-07 16:26:21.023,2021-08-07 16:25:37.000,False
68309,0.0,0,0,2017,11,4,11,1.0,0.0,0.0,0.0,0.0,0.0,0.0,150.0,1.628354e+09,2021-08-07 16:26:21.023,2021-08-07 16:25:37.000,False
68310,0.0,0,0,2017,11,4,12,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1270.0,1.628354e+09,2021-08-07 16:26:21.023,2021-08-07 16:25:39.000,False
68311,0.0,0,0,2017,11,4,12,1.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,1.628354e+09,2021-08-07 16:26:21.023,2021-08-07 16:25:39.000,False


In [29]:
weather_df.shape

(68313, 19)

In [30]:
#Combine the location and weather dataframes for final training data
dfinal = weather_df.merge(location_df, on="location", how = 'inner')
dfinal

Unnamed: 0,value,isbadair,ismobile,year,month,quarter,day,no2,o3,pm10,...,api_invocation_time_x,is_deleted_x,city,sourcename,country,sourcetype,eventtime_y,write_time_y,api_invocation_time_y,is_deleted_y
0,0.0,0,0,2018,1,1,5,1.0,0.0,0.0,...,2021-08-07 16:30:41.000,False,745.0,9.0,10.0,2.0,1.628353e+09,2021-08-07 16:24:39.200,2021-08-07 16:19:53.000,False
1,0.0,0,0,2018,1,1,6,1.0,0.0,0.0,...,2021-08-07 16:30:42.000,False,745.0,9.0,10.0,2.0,1.628353e+09,2021-08-07 16:24:39.200,2021-08-07 16:19:53.000,False
2,0.0,0,0,2018,2,1,14,1.0,0.0,0.0,...,2021-08-07 16:30:55.000,False,745.0,9.0,10.0,2.0,1.628353e+09,2021-08-07 16:24:39.200,2021-08-07 16:19:53.000,False
3,0.0,0,0,2018,2,1,8,1.0,0.0,0.0,...,2021-08-07 16:31:11.000,False,745.0,9.0,10.0,2.0,1.628353e+09,2021-08-07 16:24:39.200,2021-08-07 16:19:53.000,False
4,0.0,0,0,2018,3,1,12,1.0,0.0,0.0,...,2021-08-07 16:31:21.000,False,745.0,9.0,10.0,2.0,1.628353e+09,2021-08-07 16:24:39.200,2021-08-07 16:19:53.000,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68308,0.0,0,0,2016,8,3,12,1.0,0.0,0.0,...,2021-08-07 16:23:25.000,False,318.0,20.0,22.0,0.0,1.628353e+09,2021-08-07 16:24:38.497,2021-08-07 16:19:41.000,False
68309,0.0,0,0,2016,8,3,15,1.0,0.0,0.0,...,2021-08-07 16:23:27.000,False,318.0,20.0,22.0,0.0,1.628353e+09,2021-08-07 16:24:38.497,2021-08-07 16:19:41.000,False
68310,0.0,0,0,2016,8,3,3,1.0,0.0,0.0,...,2021-08-07 16:23:39.000,False,318.0,20.0,22.0,0.0,1.628353e+09,2021-08-07 16:24:38.497,2021-08-07 16:19:41.000,False
68311,0.0,0,0,2016,8,3,9,1.0,0.0,0.0,...,2021-08-07 16:23:43.000,False,318.0,20.0,22.0,0.0,1.628353e+09,2021-08-07 16:24:38.497,2021-08-07 16:19:41.000,False


In [31]:
#Get the subset of training features
##For the regression problem, we will use "value" as the label.  
training_features = ["value", "ismobile", "year", "month", "day",
                   "location", "city", "sourcename", "sourcetype", 
                   "no2", "o3", "pm10", "pm25", "so2", "co"]
dfinal[training_features]

Unnamed: 0,value,ismobile,year,month,day,location,city,sourcename,sourcetype,no2,o3,pm10,pm25,so2,co
0,0.0,0,2018,1,5,250.0,745.0,9.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0,2018,1,6,250.0,745.0,9.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0,2018,2,14,250.0,745.0,9.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0,2018,2,8,250.0,745.0,9.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0,2018,3,12,250.0,745.0,9.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68308,0.0,0,2016,8,12,9941.0,318.0,20.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
68309,0.0,0,2016,8,15,9941.0,318.0,20.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
68310,0.0,0,2016,8,3,9941.0,318.0,20.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
68311,0.0,0,2016,8,9,9941.0,318.0,20.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


#### Split training data into training, and validation datasets

In [32]:
from sklearn.model_selection import train_test_split

training_data, validation_data = train_test_split(dfinal[training_features], test_size=0.2, random_state=25)

print("Number of training samples : " , len(training_data))
print("Number of validation samples : " , len(validation_data))

#training_data.head(10)
#validation_data.head(10)

Number of training samples :  54650
Number of validation samples :  13663


In [33]:
## Save training and validation data back into S3 
## Write to csv in S3 without headers and index column.
prefix = 'sagemaker-featurestore-weather'
training_data.to_csv('training_dataset.csv', header=False, index=False)
s3_client.upload_file('training_dataset.csv', s3_bucket_name, prefix+'/training_input/training_dataset.csv')

validation_data.to_csv('validation_dataset.csv', header=False, index=False)
s3_client.upload_file('validation_dataset.csv', s3_bucket_name, prefix+'/validation_input/validation_dataset.csv')

In [34]:
training_data.shape

(54650, 15)

In [35]:
training_data

Unnamed: 0,value,ismobile,year,month,day,location,city,sourcename,sourcetype,no2,o3,pm10,pm25,so2,co
28762,0.0,0,2016,1,28,160.0,14.0,21.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
62410,0.0,0,2017,10,27,1275.0,1381.0,52.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
16062,0.0,0,2018,1,17,6627.0,2287.0,12.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
26461,0.0,0,2017,9,25,28.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10887,0.0,0,2017,11,27,969.0,473.0,12.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59314,0.0,0,2015,11,17,431.0,110.0,9.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
33943,0.0,0,2017,1,10,662.0,1087.0,8.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0
35702,0.0,0,2017,10,3,81.0,5.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
6618,0.0,0,2018,1,4,6093.0,174.0,6.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [36]:
validation_data

Unnamed: 0,value,ismobile,year,month,day,location,city,sourcename,sourcetype,no2,o3,pm10,pm25,so2,co
11088,0.0,0,2016,11,7,4010.0,1485.0,8.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
47919,0.0,0,2016,8,7,497.0,991.0,7.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
58541,0.0,0,2016,9,16,4337.0,103.0,17.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
30070,0.0,0,2016,9,25,3259.0,474.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
47704,0.0,0,2017,11,24,7652.0,453.0,6.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15489,0.0,0,2017,11,26,50.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
41331,0.0,0,2017,6,20,3390.0,181.0,15.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
13194,0.0,0,2016,10,25,168.0,346.0,12.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
68238,0.0,0,2017,7,11,8108.0,1398.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### 5. Train and deploy the model

In [37]:
# define the data type and paths to the training and validation datasets
content_type = "csv"
train_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket_name, prefix, 'training_input'), content_type=content_type, distribution='ShardedByS3Key')
validation_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket_name, prefix, 'validation_input'), content_type=content_type, distribution='ShardedByS3Key')

In [38]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"5"}

# set an output path where the trained model will be saved
output_path = 's3://{}/{}/output'.format(s3_bucket_name, 'xgboost')

print(output_path)

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.12xlarge', 
                                          volume_size=200, # 5 GB 
                                          output_path=output_path)



#execute the XGBoost training job
estimator.fit({'train': train_input})

s3://sagemaker-us-west-2-802439482869/xgboost/output
2021-08-07 16:38:18 Starting - Starting the training job...
2021-08-07 16:38:41 Starting - Launching requested ML instancesProfilerReport-1628354298: InProgress
......
2021-08-07 16:39:42 Starting - Preparing the instances for training......
2021-08-07 16:40:46 Downloading - Downloading input data...
2021-08-07 16:41:10 Training - Downloading the training image..[34m[2021-08-07 16:41:26.512 ip-10-0-101-215.us-west-2.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter o

In [39]:
#Deploy the model
predictor = estimator.deploy(initial_instance_count = 1, instance_type = 'ml.t2.medium')

-----------------!

### 6. Use features from feature store during inference

Goal is to find the future value of a given particulate for a given location.  The inference request from the client will therefore contain the location id, future date and name of the particulate.  This inference request will be enhanced with location and weather specific information from the online feature groups before making a prediction.


In [40]:
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer

predictor.serializer=CSVSerializer(),
predictor.deserializer=JSONDeserializer()


In [41]:
#Fields in original inference request
location = '166.0'
date_yy = '2021'
date_mm = '03'
date_dd = '26'
particulate_to_predict='no2'

Start with the fields in the original inference request and enhance the inference request with location and weather related information

In [42]:
#Since we are predicting value for no2, that field gets a 1.0 and all other values are set to 0.0 to match the encoding done during data prep.
no2_value=1.0
o3_value=0.0
pm10_value=0.0
pm25_value=0.0
so2_value=0.0
co_value=0.0

##From location FG, get city, sourcename, sourcetype
location_fg_response = sagemaker_fs_runtime_client.get_record(FeatureGroupName=location_feature_group_name, 
                                       RecordIdentifierValueAsString=location)
print(location_fg_response)

location_record=location_fg_response['Record']

cityValue = ""
sourcenameValue = ""
sourcetypeValue = ""
countryValue = ""

for i in location_record:
    featureName = i['FeatureName']
    
    if(featureName == 'city'):
        cityValue = i['ValueAsString']
    elif(featureName == 'country'):
        countryValue = i['ValueAsString']
    elif(featureName == 'sourcename'):
        sourcenameValue = i['ValueAsString']
    elif(featureName == 'sourcetype'):
        sourcetypeValue = i['ValueAsString']


##From weather_data FG, get value, year, month, day, no2, 03, pm10, pm25, so2, co
weather_fg_response = sagemaker_fs_runtime_client.get_record(FeatureGroupName=weather_data_feature_group_name, 
                                       RecordIdentifierValueAsString=location)
print(weather_fg_response)

weather_record=weather_fg_response['Record']

isbadairValue = ""
valueValue = ""
ismobileValue = ""
yearValue = date_yy
monthValue = date_mm
dayValue = date_dd

for i in weather_record:
    featureName = i['FeatureName']
    if(featureName == 'isbadair'):
        isbadairValue = i['ValueAsString']
    elif(featureName == 'value'):
        valueValue = i['ValueAsString']
    elif(featureName == 'ismobile'):
        ismobileValue = i['ValueAsString']
        
enhanced_inference_request =  yearValue + "," + monthValue + "," + dayValue + "," + ","
enhanced_inference_request += location + "," 
enhanced_inference_request += cityValue + "," + cityValue + "," + sourcenameValue + "," + sourcetypeValue + ","
enhanced_inference_request += str(no2_value) + "," + str(o3_value) + "," + str(pm10_value) + "," + str(pm25_value) + "," + str(so2_value) + "," + str(co_value)


print("Enhanced inference request")
print(enhanced_inference_request)


{'ResponseMetadata': {'RequestId': 'b9af1299-5cb2-4897-aa96-2197d336e054', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'b9af1299-5cb2-4897-aa96-2197d336e054', 'content-type': 'application/json', 'content-length': '318', 'date': 'Sat, 07 Aug 2021 17:05:12 GMT'}, 'RetryAttempts': 0}, 'Record': [{'FeatureName': 'location', 'ValueAsString': '166.0'}, {'FeatureName': 'city', 'ValueAsString': '14.0'}, {'FeatureName': 'sourcename', 'ValueAsString': '21.0'}, {'FeatureName': 'country', 'ValueAsString': '21.0'}, {'FeatureName': 'sourcetype', 'ValueAsString': '0.0'}, {'FeatureName': 'EventTime', 'ValueAsString': '1628353212'}]}
{'ResponseMetadata': {'RequestId': 'c8cdca5f-59b5-4d9b-aca1-54cf1c624130', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'c8cdca5f-59b5-4d9b-aca1-54cf1c624130', 'content-type': 'application/json', 'content-length': '746', 'date': 'Sat, 07 Aug 2021 17:05:12 GMT'}, 'RetryAttempts': 0}, 'Record': [{'FeatureName': 'value', 'ValueAsString': '0.0'}, {

##### Predict

In [43]:
ENDPOINT_NAME=predictor.endpoint
response = sagemaker_runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='text/csv',
                                        Body=enhanced_inference_request)
                               

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [44]:
result = json.loads(response['Body'].read().decode())
print(result)

0.16387253999710083


### 7. Clean up

Uncomment the below to delete the endpoint and feature groups created

In [45]:
sagemaker_client.delete_endpoint(EndpointName=ENDPOINT_NAME)

{'ResponseMetadata': {'RequestId': 'ae8f8775-1224-4021-85f7-1f124b05dd7a',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ae8f8775-1224-4021-85f7-1f124b05dd7a',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Sat, 07 Aug 2021 17:05:29 GMT'},
  'RetryAttempts': 0}}

In [46]:
sagemaker_client.delete_feature_group(FeatureGroupName=weather_data_feature_group_name)
sagemaker_client.delete_feature_group(FeatureGroupName=location_feature_group_name)

{'ResponseMetadata': {'RequestId': 'a1208ffe-717f-4860-81ae-03ed7ce5d59e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'a1208ffe-717f-4860-81ae-03ed7ce5d59e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Sat, 07 Aug 2021 17:05:31 GMT'},
  'RetryAttempts': 1}}