## Using SageMaker Feature Store for Training and Inference

In this practical example, we will develop skills how to use SageMaker Feature Store to ingest, process, and use at training and at inference times. For this we will use `IMDB movie reviews` dataset. We will tokenize movie reviews and store tokenized data in Feature Store, so you don’t have to re-tokenize the dataset next time we want to use it. After that, we will train our model to categorize positive and negative reviews using data saved in Feature Store. Then we will deploy trained model and use data from Feature Store for model inference. 

### Prerequisites
We use SageMaker Feature Store SDK to interact with Feature Store APIs. We use HuggingFace [Datasets](https://huggingface.co/docs/datasets/) and [Transformers](https://huggingface.co/docs/transformers/index) libraries to tokenize the text and run training and inference. Please make sure that these libraries are installed. You can also install required dependencies by running following command:

In [None]:
! pip install -r requirements.txt

### Preparing Dataset 

Before we begin, let's start with SageMaker imports and initiatiating SageMaker session. 

In [None]:

import boto3
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role() # or replace with role ARN in case of local enviornment
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()

Our first step is to acquire initial dataset with IMDB reviews. For this we use HuggingFace `dataset` utility.

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")

We then convert the dataset to Pandas `DataFrame` instance which is supported by Feature Store.

In [None]:
import pandas as pd
import time

dataset_df = dataset['train'].to_pandas()

Below we cast data types into supported ones by Feature StoreI. Note, that we also add metadata fields EventTime and ID. Both are required by Feature Store to support fast retrieval and feature versioning. 

In [None]:
current_time_sec = int(round(time.time()))
dataset_df["EventTime"] = pd.Series([current_time_sec]*len(dataset_df), dtype="float64")
dataset_df["ID"] = dataset_df.index
dataset_df["text"] = dataset_df["text"].astype('string')
dataset_df["text"] = dataset_df["text"].str.encode("utf8")
dataset_df["text"] = dataset_df["text"].astype('string')

Now, let’s download pre-trained tokenizer for Distilbert model and add new feature `tokenized-text` to our dataset. Note, that we cast tokenized text to string, as SageMaker Feature Store doesn’t support collection data types such as arrays or maps. 

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased',proxies = {})

dataset_df["tokenized-text"] = tokenizer(dataset_df["text"].tolist(), truncation=True, padding=True)["input_ids"]
dataset_df["tokenized-text"] = dataset_df["tokenized-text"].astype('string')

As a result, we have a Pandas DataFrame object with the features we are looking to ingest into FeatureStore. Execute the cell below to preview resulting dataset with tokenized inputs.

In [None]:
dataset_df

## Ingesting Data

Next step is to provision Feature Store resources and prepare them for ingestion.  We start by configuring `feature group` and preparing `feature definitions`. Note, that since we stored our dataset in Pandas DataFrame, Feature Store can infer features definitions automatically based on dataframe data types.

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup

imdb_feature_group_name = "imdb-reviews-tokenized"

imdb_feature_group = FeatureGroup(
    name=imdb_feature_group_name, sagemaker_session=sagemaker_session
)

imdb_feature_group.load_feature_definitions(data_frame=dataset_df)

We now ready to create our  `feature group` - a logic unit which includes multiple data records and associated features. It may take several minutes to do so, hence, we add a waiter method. Since we are planning to use both online and offline storage, we set flag `enable_online_store=True`:

In [None]:
imdb_feature_group.create(
    s3_uri=f"s3://{s3_bucket_name}/{imdb_feature_group_name}",
    record_identifier_name="ID",
    event_time_feature_name="EventTime",
    role_arn=role,
    enable_online_store=True
)

# A function to wait for FeatureGroup creation
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    print(f'Initial status: {status}')
    while status == 'Creating':
        print(f'Waiting for feature group: {feature_group.name} to be created ...')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise SystemExit(f'Failed to create feature group {feature_group.name}: {status}')
    print(f'FeatureGroup {feature_group.name} was successfully created.')

wait_for_feature_group_creation_complete(imdb_feature_group)

Once we have feature group created, we can ingest our dataset using `.ingest()` method. You can choose what's the maximum number of ingest processes can run in parallel at the same time via `max_processes`. Ingesting processes may take several minutes to complete.

In [None]:
import os

# To disable Tokenizer warning
os.environ["TOKENIZERS_PARALLELISM"] = "true"

imdb_feature_group.ingest(data_frame=dataset_df, max_processes=16, wait=True)

Once data is ingested, we can explore our dataset using SQL queries. Feature Store supports querying data using Amazon Athena SQL engine. See more details on Athena Quering capabilities [here](https://docs.aws.amazon.com/athena/latest/ug/ddl-sql-reference.html). For instance, we can run a query to understand if we are working with balanced dataset (in other words, dataset where number of classes for target label is approximately equal). The query below takes a moment to run, but in the end, you should get a count of labels in our dataset. Feel free to experiment with other queries.

In [None]:
athena_query = imdb_feature_group.athena_query()
imdb_table_name = athena_query.table_name
result = athena_query.run(f'SELECT "label", COUNT("label") as "Count" FROM "sagemaker_featurestore"."{imdb_table_name}" group by "label";', output_location=f"s3://{s3_bucket_name}/athena_output")
athena_query.wait()
print(f"Counting labels in dataset: \n {athena_query.as_dataframe()}")


Finally, let's remember a S3 location where data is stored in Feature Store Offline Storage. We will use this as an input to our training job later.

In [None]:
train_dataset_uri = imdb_feature_group.describe()['OfflineStoreConfig']["S3StorageConfig"]["ResolvedOutputS3Uri"]

## Using Feature Store for Training 

Now that we have data available in Feature Store, let’s train our binary classification model using data from Feature Store. Note, that for training job we will use `Feature Store Offline Storage`.

### Preparing Training Script
We already prepared a training script for our job. Run the cell below to review it. There are several key blocks in our training script:
- as data is stored as parquet files in Offline Storage, we read data using Pandas `.from_parquet()` method (line #49). We then use resulting dataframe instance to instantiate `dataset` instance (line #51) which will be used during model training. 
- we instantiate DistilBert model for binary classification (lines #67-#73). HuggingFace handles downloading of model weights from its public Model Hub.
- We use HuggingFace `Trainer` class to configure and execute model training (lines #75-#96)

In [None]:
! pygmentize 1_sources/train.py

### Running Training Job


Once we have training script ready, we can configure and execute training job. We start by pinning versions of used frameworks.

In [6]:
PYTHON_VERSION = "py38"
PYTORCH_VERSION = "1.10.2"
TRANSFORMER_VERSION = "4.17.0"

Run the cell below to start training job. Feel free to experiment with hyperparameters of training job.

In [None]:
from sagemaker.huggingface.estimator import HuggingFace

estimator = HuggingFace(
    py_version=PYTHON_VERSION,
    entry_point="train.py",
    source_dir="1_sources",
    pytorch_version=PYTORCH_VERSION,
    transformers_version=TRANSFORMER_VERSION,
    hyperparameters={
        "model_name":"distilbert-base-uncased",
        "train_batch_size": 16,
        "epochs": 3
        # "max_steps": 100 # to shorten training cycle, remove in real scenario
    },
    instance_type="ml.p2.xlarge",
    debugger_hook_config=False,
    disable_profiler=True,
    instance_count=1,
    role=role
)


estimator.fit(train_dataset_uri)

Depending on hyperparameters selected (specifically, number of training steps or epochs) training may take some time. Feel free to set these parameters to lower values to expedite training.

## Using Feature Store for Inference

Now let's see how we can use data stored in Feature Store at inference time. We will use for this `Feature Store Online Storage` to fetch records of our dataset and send tokenized text as an input to inference endpoint. We start by deploying trained model to SageMaker Real-Time endpoint. 

In [None]:
model = estimator.create_model(role=role, 
                               entry_point="inference.py", 
                               source_dir="1_sources",
                              )

In [None]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge"
)

Once the model is deployed, let's fetch records from Online Storage and use these records as inputs for model inference. For this, AWS provides Feature Store Runtime client as part of `boto3` library. 

In [None]:
import boto3

client = boto3.client('sagemaker-featurestore-runtime')

We can now fetch records using their record IDs. For test purposes, we retrieve first 3 records from Feature Store using their respective ids.

In [None]:
response = client.batch_get_record(
    Identifiers=[
        {
            'FeatureGroupName':imdb_feature_group.name,
            'RecordIdentifiersValueAsString': ["0", "1", "2"], # picking several records to run inference.
            'FeatureNames': [
                'tokenized-text', "label", 'text'
            ]
        },
    ]
)

Next, we process featured records according to model requirements. 

In [None]:
# preparing the inference payload
labels = []
input_ids = []
texts = []

for record in response["Records"]:
    for feature in record["Record"]:
        if feature["FeatureName"]=="label":
            labels.append(feature["ValueAsString"])
        if feature["FeatureName"]=="tokenized-text":
            list_of_str = feature["ValueAsString"].strip("][").split(", ")
            input_ids.append([int(el) for el in list_of_str])
        if feature["FeatureName"]=="text":
            # list_of_str = feature["ValueAsString"].strip("][").split(", ")
            texts.append(feature["ValueAsString"])    

print(f"Sample label value: {labels[0]}")
print(f"Sample list of token ids:\n{input_ids[0]}")
print(f"Sample list of token ids:\n{texts[0]}")


We then send inference requests to SageMaker endpoint. Note, that depending how long your trained your model, inference results and model accuracy may vary.

In [None]:
for i in range(len(labels)):
    prediction = predictor.predict([texts[i]])
    print(f"Sample index: {i}; predicted label: {prediction[0]['label']}; confidence score: {prediction[0]['score']}")

## Resource Cleanup

Execute the cell below to clean up all SageMaker resources and avoid any costs

In [None]:
# Delete SageMaker model and endpoint
predictor.delete_endpoint(delete_endpoint_config=True)
model.delete_model()

# Delete Feature Store resources
imdb_feature_group.delete()