## Assignment background:

Machine Learning Engineers are often in charge of the entire ML Platform, which Data Scientists and Data Analysts will use to build models. This includes the Feature Store. Feature Stores are powerful tools that allow Data Scientists and Data Analysts to share and discover features across Machine Learning Systems. As you have learned during this module, feature engineering can be a daunting task that comprises the lion’s share of ML system development time and can really make or break an ML system. Leveraging a Feature Store allows us to share feature engineering work across different models so we do not have to start from scratch on each new model. Feature Stores also have some additional benefits in providing the latest data for real-time ML systems. Machine Learning Engineers should be familiar with setting up a Feature Store and interacting with a Feature Store.

In [None]:
# Matt Thompson
# Assignment 3.1
# AAI 540

# The first few cells here are about getting the current working directory, getting the data from the two data files, and reading the data.

In [None]:
%pwd

'/home/sagemaker-user'

In [None]:
%ls

Untitled.ipynb  Untitled1.ipynb  [0m[01;34maai-540-labs[0m/  [01;36muser-default-efs[0m@


In [None]:
# Let's try to read the data in from the two data files

import pandas as pd
import time
import os
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import get_execution_role

# Define the subdirectory path
DATA_PATH = './aai-540-labs/lab-3-1-sagemaker-feature-store/'

# This setup is important for creating the Feature Group later.
sagemaker_session = Session()
region = sagemaker_session.boto_region_name
default_bucket = sagemaker_session.default_bucket()
prefix = 'featurestore-neighborhood-data'
s3_uri = f's3://{default_bucket}/{prefix}'
print("AWS and SageMaker session established.")

# Load the data
print(f"\nLoading raw data files from: {DATA_PATH}...")
try:
    housing_filepath = os.path.join(DATA_PATH, 'housing.csv')
    maps_filepath = os.path.join(DATA_PATH, 'maps.csv')

    housing_df = pd.read_csv(housing_filepath)
    maps_df = pd.read_csv(maps_filepath)
    print("Data files loaded successfully.")
except FileNotFoundError as e:
    print(f"Error: One or both files not found. Please ensure the path is correct and the files exist:")
    print(f"  Housing file expected at: {housing_filepath}")
    print(f"  Maps file expected at: {maps_filepath}")
    print(f"Details: {e}")
    # Empty dataframes if error
    housing_df = pd.DataFrame()
    maps_df = pd.DataFrame()

# Print DataFrames Head
print("\n--- housing.csv Head ---")
if not housing_df.empty:
    print(housing_df.head())
else:
    print("Dataframe is empty due to loading error.")

print("\n--- maps.csv Head ---")
if not maps_df.empty:
    print(maps_df.head())
else:
    print("Dataframe is empty due to loading error.")

AWS and SageMaker session established.

Loading raw data files from: ./aai-540-labs/lab-3-1-sagemaker-feature-store/...
Data files loaded successfully.

--- housing.csv Head ---
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341

# Now that we have the data, print out the column names as we will need these later

In [None]:
# Let's print out the columns names

print("--- housing.csv Columns ---")
# Check if DataFrame is not empty before printing columns
if not housing_df.empty:
    print(housing_df.columns.tolist())
else:
    print("housing_df is empty. Please re-check the file path and loading process.")

print("\n--- maps.csv Columns ---")
if not maps_df.empty:
    print(maps_df.columns.tolist())
else:
    print("maps_df is empty. Please re-check the file path and loading process.")

--- housing.csv Columns ---
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value', 'ocean_proximity']

--- maps.csv Columns ---
['street_number', 'route', 'locality-political', 'administrative_area_level_2-political', 'administrative_area_level_1-political', 'country-political', 'postal_code', 'address', 'longitude', 'latitude', 'neighborhood-political', 'postal_code_suffix', 'establishment-point_of_interest-transit_station', 'establishment-park-point_of_interest', 'premise', 'establishment-point_of_interest-subway_station-transit_station', 'airport-establishment-finance-moving_company-point_of_interest-storage', 'subpremise', 'bus_station-establishment-point_of_interest-transit_station', 'establishment-park-point_of_interest-tourist_attraction', 'establishment-natural_feature', 'airport-establishment-point_of_interest', 'political-sublocality-sublocality_level_1', 'administrative_area_level_3-

In [None]:
# Let's try some feature engineering and data preparation
#  Begin step 2

import pandas as pd
import time
import numpy as np

# These are the required feature names for the Feature Group
RECORD_IDENTIFIER_FEATURE_NAME = 'primary_key'
EVENT_TIME_FEATURE_NAME = 'event_time'

# Try to Create Common Geographical Key for Joining
# Round coordinates (to 3 decimal places) to create common key
maps_df['Geo_Join_Key'] = maps_df['latitude'].round(3).astype(str) + '_' + maps_df['longitude'].round(3).astype(str)
housing_df['Geo_Join_Key'] = housing_df['latitude'].round(3).astype(str) + '_' + housing_df['longitude'].round(3).astype(str)

# Select map keys and drop duplicates for unique neighborhood name
map_keys = maps_df[['Geo_Join_Key', 'neighborhood-political']].dropna(subset=['neighborhood-political']).drop_duplicates(subset=['Geo_Join_Key'])


# Join DataFrames to associate houses with a neighborhood
merged_df = pd.merge(
    housing_df,
    map_keys,
    on='Geo_Join_Key',
    how='left'
).dropna(subset=['neighborhood-political'])

# Limit 'median_house_value' at $500,000
merged_df['capped_median_house_value'] = merged_df['median_house_value'].clip(upper=500000)

# Need to create discrete 'housing_median_age' (0-9, 10-19, etc.)
merged_df['discretized_median_house_age'] = (merged_df['housing_median_age'] // 10) * 10

# Calculate 'bedrooms per household' ratio
merged_df['bedrooms_per_household'] = merged_df['total_bedrooms'] / merged_df['households']

# One-Hot Encode 'ocean_proximity'
# Mak sure all 5 required categories exist, filling missing with 0 to prevent errors
required_ocean_categories = ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
ohe_df = pd.get_dummies(merged_df['ocean_proximity'], prefix='', prefix_sep='').astype(int)
ohe_df = ohe_df.reindex(columns=required_ocean_categories, fill_value=0)

merged_df = pd.concat([merged_df, ohe_df], axis=1)

# Aggregate up to the Neighborhood Level
# Group by the primary key: 'neighborhood-political'
neighborhood_df = merged_df('neighborhood-political').agg(

    # Median house value (capped & averaged)
    median_house_value=('capped_median_house_value', 'mean'),

    # Median house age (discretized & averaged)
    median_house_age=('discretized_median_house_age', 'mean'),

    # Total households (averaged)
    total_households=('households', 'mean'),

    # Bedrooms per household (averaged)
    bedrooms_per_household=('bedrooms_per_household', 'mean'),

    # One-Hot Encoded proportions (averaged OHE columns)
    ocean_lt_1h_prop=('<1H OCEAN', 'mean'),
    ocean_inland_prop=('INLAND', 'mean'),
    ocean_island_prop=('ISLAND', 'mean'),
    ocean_near_bay_prop=('NEAR BAY', 'mean'),
    ocean_near_ocean_prop=('NEAR OCEAN', 'mean')
).reset_index()

# Final Data Cleaning and Feature Store Setup

# Rename Primary Key column
neighborhood_df.rename(columns={'neighborhood-political': RECORD_IDENTIFIER_FEATURE_NAME}, inplace=True)

# Round 'total_households' up to the nearest whole number
neighborhood_df['total_households'] = np.ceil(neighborhood_df['total_households']).astype(int)

# Rename One-Hot Encoded columns to final names
# Needed to replace spaces and special characters with underscores/standard characters (this caused errors initially)
neighborhood_df.rename(columns={
    'ocean_lt_1h_prop': 'ocean_lt_1h',
    'ocean_inland_prop': 'inland',
    'ocean_island_prop': 'island',
    'ocean_near_bay_prop': 'ocean_near_bay',
    'ocean_near_ocean_prop': 'ocean_near_ocean'
}, inplace=True)

# Add Event Time
current_time_sec = int(round(time.time()))
neighborhood_df[EVENT_TIME_FEATURE_NAME] = pd.Series([current_time_sec] * len(neighborhood_df), dtype="float64")

# Print to verify
print("✅ Step 2 (Feature Engineering) Complete. Final DataFrame Head:")
print(neighborhood_df.head())
print("\nFinal Feature Group Schema (Names and Types):")
print(neighborhood_df.info())

✅ Step 2 (Feature Engineering) Complete. Final DataFrame Head:
                      primary_key  median_house_value  median_house_age  \
0                        28 Palms       222200.000000              20.0   
1                Acorn Industrial        81300.000000              50.0   
2                      Adams Hill       250733.333333              35.0   
3  Agua Mansa Industrial Corridor       112300.000000              10.0   
4                        Al Tahoe       109180.000000              20.0   

   total_households  bedrooms_per_household  ocean_lt_1h  inland  island  \
0               923                1.017335          1.0     0.0     0.0   
1               147                1.659864          0.0     0.0     0.0   
2               494                1.034649          1.0     0.0     0.0   
3               516                1.102713          0.0     1.0     0.0   
4               249                1.641739          0.0     1.0     0.0   

   ocean_near_bay  ocean_near

In [None]:
# Instantiate and create feature group
# Step 3 for this

from sagemaker.feature_store.feature_group import FeatureGroup
import time

# Define variables
# Duplicate to enable debugging of this cell
RECORD_IDENTIFIER_FEATURE_NAME = 'primary_key'
EVENT_TIME_FEATURE_NAME = 'event_time'

# Define unique, compliant name for the new Feature Group
FEATURE_GROUP_NAME = 'compliant-neighborhood-fg-' + time.strftime('%Y%m%d%H%M%S', time.gmtime())

# Instantiate feature group
neighborhood_feature_group = FeatureGroup(
    name=FEATURE_GROUP_NAME,
    sagemaker_session=sagemaker_session
)

# Load Feature Definitions (Schema Inference)
neighborhood_feature_group.load_feature_definitions(data_frame=neighborhood_df)

print(f"\nFeature Definitions inferred for {FEATURE_GROUP_NAME}:")
for feature in neighborhood_feature_group.feature_definitions:
    print(f"  - {feature.feature_name}: {feature.feature_type.value}")

# Create the Feature Group in AWS
print(f"\nCreating Feature Group: {FEATURE_GROUP_NAME}...")

neighborhood_feature_group.create(
    s3_uri=s3_uri,
    record_identifier_name=RECORD_IDENTIFIER_FEATURE_NAME, # 'primary_key'
    event_time_feature_name=EVENT_TIME_FEATURE_NAME,       # 'event_time'
    role_arn=role,
    enable_online_store=True
)

# Wait for rceation to complete
print("Waiting for Feature Group to be Created (up to 10 minutes)...")
try:
    # Use the SDK's built-in wait capability
    neighborhood_feature_group.wait_for_feature_group_creation_complete(timeout=600)
    status = neighborhood_feature_group.describe().get('FeatureGroupStatus')
    print(f"✅ Feature Group Status: {status}")
except Exception as e:
    # If the wait times out, provide status check instruction
    current_status = neighborhood_feature_group.describe().get('FeatureGroupStatus')
    print(f"❌ Feature Group creation failed or timed out. Status: {current_status}")
    print("Please run the manual polling code from our previous attempts to confirm status and proceed.")


Feature Definitions inferred for compliant-neighborhood-fg-20250926085159:
  - primary_key: String
  - median_house_value: Fractional
  - median_house_age: Fractional
  - total_households: Integral
  - bedrooms_per_household: Fractional
  - ocean_lt_1h: Fractional
  - inland: Fractional
  - island: Fractional
  - ocean_near_bay: Fractional
  - ocean_near_ocean: Fractional
  - event_time: Fractional

Creating Feature Group: compliant-neighborhood-fg-20250926085159...
Waiting for Feature Group to be Created (up to 10 minutes)...
❌ Feature Group creation failed or timed out. Status: Creating
Please run the manual polling code from our previous attempts to confirm status and proceed.


# Wanted to create a polling feature for the feature group creation process

In [None]:
# Poll feature
# Used Gemini for help here

import time

def wait_for_feature_group_creation(feature_group, timeout_minutes=15):
    """Polls the Feature Group status until it's Created or CreateFailed."""
    start_time = time.time()
    timeout_seconds = timeout_minutes * 60

    print(f"Polling status for Feature Group: {feature_group.name}...")

    while True:
        try:
            # Use the describe() method to get the current status from AWS
            description = feature_group.describe()
            status = description.get('FeatureGroupStatus')
        except Exception as e:
            # Handle transient connection issues or other AWS errors - definitely needed help with this
            print(f"Error describing Feature Group: {e}. Retrying...")
            status = 'Creating'

        if status == 'Created':
            print(f"✅ Feature Group Status: {status} (Ready to ingest data!)")
            break
        elif status in ('CreateFailed', 'Deleting', 'DeleteFailed'):
            print(f"❌ Feature Group Status: {status}.")
            print(f"Failure reason: {description.get('FailureReason')}")
            raise Exception(f"Feature Group creation failed with status: {status}")

        elapsed_time = time.time() - start_time
        if elapsed_time > timeout_seconds:
            print(f"❌ Timed out after {timeout_minutes} minutes. Current Status: {status}.")
            print("The resource might still be provisioning; please check the SageMaker Feature Store console.")
            break

        print(f"Status: {status}... Waiting (Elapsed time: {int(elapsed_time)}s)")
        time.sleep(30) # Wait 30 seconds before checking again

# Execute polling
try:
    wait_for_feature_group_creation(neighborhood_feature_group)
except Exception as e:
    print(e)


Polling status for Feature Group: compliant-neighborhood-fg-20250926085159...
✅ Feature Group Status: Created (Ready to ingest data!)


# Once the polling function says tehe feature group has been successfully created, we can proceed to ingest the data into the feature group

In [None]:
# Let's ingest the data into the feature group
# Step 4

# Duplicative to enable debugging of the cell independently
RECORD_IDENTIFIER_FEATURE_NAME = 'primary_key'
EVENT_TIME_FEATURE_NAME = 'event_time'


print("\nIngesting data from neighborhood_df into the Feature Group...")

# Ingest the DataFrame into the Feature Group.
ingestion_manager = neighborhood_feature_group.ingest(
    data_frame=neighborhood_df,
    max_workers=3,
    wait=True
)

# Check ingestion status
try:
    # Attempt the this check first (seems there can be different versions)
    failed_rows = ingestion_manager.get_failed_rows()
    if failed_rows:
        print(f"\n❌ Data Ingestion FAILED. Failed Rows Count: {len(failed_rows)}. First failed row:")
        print(failed_rows[:1])
    else:
        print("\n✅ Data Ingestion Status: Completed")
        print("All records ingested successfully into the Online and Offline Stores!")

except AttributeError:
    # Try older SDK versions where get_failed_rows() is missing
    print("\nAttempting alternative status check...")
    try:
        if ingestion_manager._async_result.successful():
            print("✅ Data Ingestion Status: Completed (No exception raised during wait=True).")
            print("Assuming all records were ingested successfully.")
        else:
            print("❌ Data Ingestion FAILED (Internal error detected). Check CloudWatch logs.")
    except AttributeError:
        # Final try, rely on the 'wait=True' completion
        print("✅ Data Ingestion Status: Completed (Wait finished without error).")
        print("Assuming all records were ingested successfully.")


# Verification step
print("\nVerifying a record using the Online Store (GetRecord API):")
try:
    sagemaker_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)

    # Get the value of the first record identifier from the DataFrame
    record_id_value = str(neighborhood_df.iloc[0][RECORD_IDENTIFIER_FEATURE_NAME])

    response = sagemaker_runtime_client.get_record(
        FeatureGroupName=neighborhood_feature_group.name,
        RecordIdentifierValueAsString=record_id_value
    )

    if response.get('Record'):
        print(f"✅ Record for Key '{record_id_value}' retrieved successfully.")
        print("Data ingestion confirmed!")
    else:
         print("❌ Verification FAILED. Record not found in the Online Store.")

except Exception as e:
    print("Verification FAILED. Could not retrieve record.")


Ingesting data from neighborhood_df into the Feature Group...

Attempting alternative status check...
✅ Data Ingestion Status: Completed (No exception raised during wait=True).
Assuming all records were ingested successfully.

Verifying a record using the Online Store (GetRecord API):
✅ Record for Key '28 Palms' retrieved successfully.
Data ingestion confirmed!


# With the feature store built and now loaded with data, we should be able to run the queries called out in the assignment.



In [1]:
# Query 1
# Query Brooktree

# These are each duplicative to allow for debugging separately

import time

sagemaker_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)
FEATURE_GROUP_NAME = neighborhood_feature_group.name
RECORD_IDENTIFIER_FEATURE_NAME = 'primary_key'
query_key = "Brooktree"

print(f"--- Query 1: Get Real-Time Record for primary_key = '{query_key}' ---")

response = sagemaker_runtime_client.get_record(
    FeatureGroupName=FEATURE_GROUP_NAME,
    RecordIdentifierValueAsString=query_key
)

record = response.get('Record')
if record:
    print("\n✅ Query Successful! Retrieved Features:")
    for feature in record:
        print(f"  - {feature['FeatureName']}: {feature['ValueAsString']}")
else:
    print(f"❌ Query Failed: Record for '{query_key}' not found. (Check if the key was present in the aggregated data.)")

NameError: name 'sagemaker_session' is not defined

In [None]:
# Query 2
# Query Fisherman's Wharf

import time

sagemaker_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)
FEATURE_GROUP_NAME = neighborhood_feature_group.name
RECORD_IDENTIFIER_FEATURE_NAME = 'primary_key'
query_key = "Fisherman’s Wharf"

print(f"--- Query 2: Get Real-Time Record for primary_key = '{query_key}' ---")

# Using neighborhood_feature_group.name
response = sagemaker_runtime_client.get_record(
    FeatureGroupName=neighborhood_feature_group.name,
    RecordIdentifierValueAsString=query_key
)

record = response.get('Record')
if record:
    print("\n✅ Query Successful! Retrieved Features:")
    for feature in record:
        print(f"  - {feature['FeatureName']}: {feature['ValueAsString']}")
else:
    print(f"❌ Query Failed: Record for '{query_key}' not found. (Check if the key was present in the aggregated data.)")

--- Query 2: Get Real-Time Record for primary_key = 'Fisherman’s Wharf' ---
❌ Query Failed: Record for 'Fisherman’s Wharf' not found. (Check if the key was present in the aggregated data.)


In [None]:
# Query 3
# Query Los Osos

import time

sagemaker_runtime_client = sagemaker_session.boto_session.client('sagemaker-featurestore-runtime', region_name=region)
FEATURE_GROUP_NAME = neighborhood_feature_group.name
RECORD_IDENTIFIER_FEATURE_NAME = 'primary_key'
query_key = "Los Osos"

print(f"--- Query 3: Get Real-Time Record for primary_key = '{query_key}' ---")

response = sagemaker_runtime_client.get_record(
    FeatureGroupName=neighborhood_feature_group.name,
    RecordIdentifierValueAsString=query_key
)

record = response.get('Record')
if record:
    print("\n✅ Query Successful! Retrieved Features:")
    for feature in record:
        print(f"  - {feature['FeatureName']}: {feature['ValueAsString']}")
else:
    print(f"❌ Query Failed: Record for '{query_key}' not found. (Check if the key was present in the aggregated data.)")


--- Query 3: Get Real-Time Record for primary_key = 'Los Osos' ---

✅ Query Successful! Retrieved Features:
  - primary_key: Los Osos
  - median_house_value: 221612.5
  - median_house_age: 11.25
  - total_households: 612
  - bedrooms_per_household: 1.0478845404823531
  - ocean_lt_1h: 0.0
  - inland: 0.0
  - island: 0.0
  - ocean_near_bay: 0.0
  - ocean_near_ocean: 1.0
  - event_time: 1758876561.0
