# Run SKLearnProcessor: S3 CSV ‚Üí Feature Engineering ‚Üí Feature Store (Offline) + Train/Val/Test splits

Notebook ‡∏ô‡∏µ‡πâ‡∏£‡∏±‡∏ô‡∏™‡∏Ñ‡∏£‡∏¥‡∏õ‡∏ï‡πå `preprocess-scikit-retail-feature-store_NEW.py` ‡∏ó‡∏µ‡πà‡∏ó‡∏≥‡∏á‡∏≤‡∏ô‡∏î‡∏±‡∏á‡∏ô‡∏µ‡πâ:

1. ‡∏≠‡πà‡∏≤‡∏ô raw CSV ‡∏à‡∏≤‡∏Å S3 (ProcessingInput ‚Üí `/opt/ml/processing/input/data/`)
2. ‡∏ó‡∏≥ feature engineering (calendar + promo features) ‡πÅ‡∏•‡∏∞‡∏™‡∏£‡πâ‡∏≤‡∏á `high_demand`
3. split ‡πÄ‡∏õ‡πá‡∏ô `train / validation / test` ‡πÅ‡∏•‡πâ‡∏ß export ‡πÄ‡∏õ‡πá‡∏ô CSV
4. ingest ‡∏ó‡∏±‡πâ‡∏á‡∏´‡∏°‡∏î‡πÄ‡∏Ç‡πâ‡∏≤ **SageMaker Feature Store (Offline store)**
5. ‡∏£‡∏≠‡πÉ‡∏´‡πâ Offline store **Active** ‡πÅ‡∏•‡∏∞‡∏£‡∏≠‡πÉ‡∏´‡πâ **Glue Data Catalog table** ‡∏õ‡∏£‡∏≤‡∏Å‡∏è (‡πÄ‡∏û‡∏∑‡πà‡∏≠ query ‡πÉ‡∏ô Athena ‡πÑ‡∏î‡πâ)

> ‚úÖ ‡∏´‡∏°‡∏≤‡∏¢‡πÄ‡∏´‡∏ï‡∏∏: ‡∏™‡∏Ñ‡∏£‡∏¥‡∏õ‡∏ï‡πå‡πÄ‡∏ß‡∏≠‡∏£‡πå‡∏ä‡∏±‡∏ô‡∏ô‡∏µ‡πâ *‡πÑ‡∏°‡πà* ‡∏£‡∏±‡∏ô Athena DDL ‡πÄ‡∏≠‡∏á (‡∏´‡∏•‡∏µ‡∏Å‡πÄ‡∏•‡∏µ‡πà‡∏¢‡∏á‡∏õ‡∏±‡∏ç‡∏´‡∏≤ SQL dialect ‡∏Ç‡∏≠‡∏á Athena engine) ‡πÅ‡∏ï‡πà‡∏à‡∏∞‡∏£‡∏≠‡πÉ‡∏´‡πâ Feature Store ‡∏™‡∏£‡πâ‡∏≤‡∏á Glue table ‡πÉ‡∏´‡πâ‡∏≠‡∏±‡∏ï‡πÇ‡∏ô‡∏°‡∏±‡∏ï‡∏¥

In [1]:
import os
import boto3
import sagemaker
from time import gmtime, strftime

from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sess = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

print('Region:', region)
print('Bucket:', bucket)
print('Role:', role)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Region: us-east-1
Bucket: sagemaker-us-east-1-423623839320
Role: arn:aws:iam::423623839320:role/service-role/SageMaker-ExecutionRole-20250705T232334


In [8]:
# -------------------------------
# 1) Raw data source (S3)
# -------------------------------
# üëâ ‡∏ï‡πâ‡∏≠‡∏á‡πÄ‡∏õ‡πá‡∏ô S3 prefix/folder ‡∏ó‡∏µ‡πà‡∏°‡∏µ‡πÑ‡∏ü‡∏•‡πå .csv ‡∏≠‡∏¢‡∏π‡πà‡∏Ç‡πâ‡∏≤‡∏á‡πÉ‡∏ô (1 ‡πÑ‡∏ü‡∏•‡πå‡∏´‡∏£‡∏∑‡∏≠‡∏´‡∏•‡∏≤‡∏¢‡πÑ‡∏ü‡∏•‡πå‡∏Å‡πá‡πÑ‡∏î‡πâ)
# ‡∏ï‡∏±‡∏ß‡∏≠‡∏¢‡πà‡∏≤‡∏á:
# raw_input_data_s3_uri = f"s3://{bucket}/retail-demand/raw/2025-12-07/"
raw_input_data_s3_uri = f"s3://{bucket}/retail-demand-forecasting/csv/"

# -------------------------------
# 2) Processing output prefix (timestamped ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÑ‡∏°‡πà overwrite)
# -------------------------------
timestamp = strftime('%Y-%m-%d-%H-%M-%S', gmtime())
preprocess_output_s3_uri = f"s3://{bucket}/retail-demand/preprocess-{timestamp}/"

# -------------------------------
# 3) Split parameters
# -------------------------------
train_split_percentage = 0.90
validation_split_percentage = 0.05
test_split_percentage = 0.05

# -------------------------------
# 4) Feature Store config
# -------------------------------
# ‚úÖ ‡πÅ‡∏ô‡∏∞‡∏ô‡∏≥‡πÉ‡∏´‡πâ FIXED name/prefix (‡πÄ‡∏û‡∏∑‡πà‡∏≠ append/upsert ‡∏Ç‡πâ‡∏≠‡∏°‡∏π‡∏•‡∏£‡∏≤‡∏¢‡∏ß‡∏±‡∏ô‡πÄ‡∏Ç‡πâ‡∏≤‡∏Å‡∏•‡∏∏‡πà‡∏°‡πÄ‡∏î‡∏¥‡∏°)
feature_group_name = 'retail_product_demand_feature_group'
feature_store_offline_prefix = f"s3://{bucket}/feature-store/retail-demand/offline-store"

print('raw_input_data_s3_uri        =', raw_input_data_s3_uri)
print('preprocess_output_s3_uri     =', preprocess_output_s3_uri)
print('feature_group_name           =', feature_group_name)
print('feature_store_offline_prefix =', feature_store_offline_prefix)

raw_input_data_s3_uri        = s3://sagemaker-us-east-1-423623839320/retail-demand-forecasting/csv/
preprocess_output_s3_uri     = s3://sagemaker-us-east-1-423623839320/retail-demand/preprocess-2025-12-12-14-42-43/
feature_group_name           = retail_product_demand_feature_group
feature_store_offline_prefix = s3://sagemaker-us-east-1-423623839320/feature-store/retail-demand/offline-store


In [9]:
athena_database = "retail_demand"        # <-- change to your DB name
athena_table = "demand_product"        # <-- change to your table name

# Simple default: train on all rows
athena_query = f"SELECT * FROM {athena_database}.{athena_table}"

print("Athena database:", athena_database)
print("Athena table   :", athena_table)
print("Athena query   :", athena_query)


Athena database: retail_demand
Athena table   : demand_product
Athena query   : SELECT * FROM retail_demand.demand_product


## SKLearnProcessor

‡πÄ‡∏£‡∏≤‡∏à‡∏∞‡∏£‡∏±‡∏ô‡∏™‡∏Ñ‡∏£‡∏¥‡∏õ‡∏ï‡πå preprocess ‡πÉ‡∏ô SageMaker Processing job ‡πÇ‡∏î‡∏¢:
- input: `raw_input_data_s3_uri` ‚Üí mount ‡πÑ‡∏õ‡∏ó‡∏µ‡πà `/opt/ml/processing/input/data/`
- output: train/validation/test ‚Üí upload ‡πÑ‡∏õ‡∏ó‡∏µ‡πà `preprocess_output_s3_uri`

In [10]:
processing_instance_type = 'ml.c5.2xlarge'
processing_instance_count = 1

processor = SKLearnProcessor(
    framework_version='0.23-1',   # ‡πÉ‡∏ä‡πâ‡∏£‡∏∏‡πà‡∏ô‡πÄ‡∏î‡∏µ‡∏¢‡∏ß‡∏Å‡∏±‡∏ö‡∏ó‡∏µ‡πà‡πÄ‡∏Ñ‡∏¢‡πÉ‡∏ä‡πâ‡πÉ‡∏ô processing step ‡∏Å‡πà‡∏≠‡∏ô‡∏´‡∏ô‡πâ‡∏≤ (Py3.7)
    role=role,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    env={'AWS_DEFAULT_REGION': region},
    max_runtime_in_seconds=7200,
)

print('Processor image:', processor.image_uri)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


Processor image: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3


In [11]:
# ‡∏ï‡∏£‡∏ß‡∏à‡∏™‡∏≠‡∏ö‡∏ß‡πà‡∏≤‡πÑ‡∏ü‡∏•‡πå‡∏™‡∏Ñ‡∏£‡∏¥‡∏õ‡∏ï‡πå‡∏≠‡∏¢‡∏π‡πà‡πÉ‡∏ô working directory ‡∏Ç‡∏≠‡∏á notebook
script_path = 'preprocess-scikit-retail-feature-store.py'
print('Script exists?', os.path.exists(script_path), '->', script_path)
if not os.path.exists(script_path):
    raise FileNotFoundError(
        'Cannot find preprocess-scikit-retail-feature-store.py in the current directory. '        'Please place the script next to this notebook (same folder) before running.'
    )

Script exists? True -> preprocess-scikit-retail-feature-store.py


In [12]:
# job_name = f"retail-feature-store-{timestamp}".lower()
# job_name = job_name.replace('_', '-').replace(':', '-')
# # SageMaker job name length limit ~63 chars
# job_name = job_name[:63]

# print('Processing job name:', job_name)

In [14]:


processor.run(
    code='preprocess-scikit-retail-feature-store.py',
    inputs=[
        # ProcessingInput(
        #     source=raw_input_data_s3_uri,
        #     destination='/opt/ml/processing/input/data',
        # )
    ],
    outputs=[
        ProcessingOutput(
            output_name='train',
            source='/opt/ml/processing/output/retail_product/train',
            destination=preprocess_output_s3_uri + 'train/',
        ),
        ProcessingOutput(
            output_name='validation',
            source='/opt/ml/processing/output/retail_product/validation',
            destination=preprocess_output_s3_uri + 'validation/',
        ),
        ProcessingOutput(
            output_name='test',
            source='/opt/ml/processing/output/retail_product/test',
            destination=preprocess_output_s3_uri + 'test/',
        ),
    ],
    arguments=[
        '--input-data', '/opt/ml/processing/input/data/',
        '--output-data', '/opt/ml/processing/output/retail_product',
        '--train-split-percentage', str(train_split_percentage),
        '--validation-split-percentage', str(validation_split_percentage),
        '--test-split-percentage', str(test_split_percentage),
        '--feature-group-name', feature_group_name,
        '--feature-store-offline-prefix', feature_store_offline_prefix,
        # üëá ‡πÉ‡∏´‡∏°‡πà: ‡∏™‡πà‡∏á Athena parameters ‡πÄ‡∏Ç‡πâ‡∏≤‡πÑ‡∏õ
        "--athena-database", athena_database,
        "--athena-table", athena_table,
        "--athena-query", athena_query,

    ],
    # job_name=job_name,
    logs=True,
    wait=True
)

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2025-12-12-14-43-13-900


.........Collecting sagemaker==2.24.1
  Downloading sagemaker-2.24.1.tar.gz (397 kB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 397.4/397.4 kB 30.2 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting attrs
  Downloading attrs-24.2.0-py3-none-any.whl (63 kB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 63.0/63.0 kB 10.7 MB/s eta 0:00:00
Collecting google-pasta
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 57.5/57.5 kB 9.3 MB/s eta 0:00:00
Collecting protobuf3-to-dict>=0.1.5
  Downloading protobuf3-to-dict-0.1.5.tar.gz (3.5 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status

In [None]:
# ‡∏î‡∏∂‡∏á‡∏ä‡∏∑‡πà‡∏≠ Processing Job
processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']
print('Processing job name:', processing_job_name)

running_job = sagemaker.processing.ProcessingJob.from_processing_name(
    processing_job_name=processing_job_name,
    sagemaker_session=sess,
)

desc = running_job.describe()
outputs = {o['OutputName']: o['S3Output']['S3Uri'] for o in desc['ProcessingOutputConfig']['Outputs']}

processed_train_data_s3_uri = outputs.get('train')
processed_validation_data_s3_uri = outputs.get('validation')
processed_test_data_s3_uri = outputs.get('test')

print('Processed train S3:      ', processed_train_data_s3_uri)
print('Processed validation S3: ', processed_validation_data_s3_uri)
print('Processed test S3:       ', processed_test_data_s3_uri)

# ‡πÄ‡∏Å‡πá‡∏ö‡πÑ‡∏ß‡πâ‡πÉ‡∏ä‡πâ‡∏ï‡πà‡∏≠‡πÉ‡∏ô notebook train/evaluate/pipeline
%store processed_train_data_s3_uri
%store processed_validation_data_s3_uri
%store processed_test_data_s3_uri
%store feature_group_name
%store feature_store_offline_prefix

In [None]:
# ‡∏î‡∏≤‡∏ß‡∏ô‡πå‡πÇ‡∏´‡∏•‡∏î train split ‡∏°‡∏≤‡∏î‡∏π
!rm -rf data
!mkdir -p data/train
!aws s3 cp $processed_train_data_s3_uri data/train/ --recursive
!ls -al data/train | head

import glob
import pandas as pd

train_files = sorted(glob.glob('data/train/*.csv'))
print('Train files:', train_files)

df = pd.read_csv(train_files[0])
df.head()

## Next

- ‡∏ï‡∏≠‡∏ô‡∏ô‡∏µ‡πâ‡∏Ñ‡∏∏‡∏ì‡∏°‡∏µ train/validation/test ‡∏≠‡∏¢‡∏π‡πà‡∏ö‡∏ô S3 ‡πÅ‡∏•‡πâ‡∏ß (`processed_*_data_s3_uri`) ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡πÉ‡∏ä‡πâ‡∏ï‡πà‡∏≠‡πÉ‡∏ô TrainingStep
- Feature Store ingest ‡πÄ‡∏™‡∏£‡πá‡∏à‡πÅ‡∏•‡πâ‡∏ß ‡πÅ‡∏•‡∏∞‡∏™‡∏Ñ‡∏£‡∏¥‡∏õ‡∏ï‡πå‡∏à‡∏∞‡∏£‡∏≠‡∏à‡∏ô Glue table ‡∏û‡∏£‡πâ‡∏≠‡∏°
  - ‡πÑ‡∏õ‡∏ó‡∏µ‡πà Athena ‚Üí ‡πÄ‡∏•‡∏∑‡∏≠‡∏Å database/table ‡∏ï‡∏≤‡∏°‡∏ó‡∏µ‡πà log ‡∏Ç‡∏≠‡∏á Processing job ‡πÅ‡∏™‡∏î‡∏á (`DataCatalogConfig: <db> <table>`)

‡∏ï‡∏±‡∏ß‡∏≠‡∏¢‡πà‡∏≤‡∏á query:

```sql
SELECT *
FROM "<database>"."<table>"
LIMIT 10;
```