# Data Quality Check ด้วย Deequ (Spark + SageMaker Processing)

ส่วนนี้ตาม template ของคุณ: ใช้ Spark Processing Job + Deequ เพื่อเช็ค data quality เช่น

- completeness (หายไหม)

- uniqueness ของ (date, store_id)

- non-negative constraints (units_sold, base_price)

- discount อยู่ในช่วง 0–1 ฯลฯ

*ขั้นตอน:*

- สร้างสคริปต์ preprocess-deequ-pyspark.py

- รัน Spark Processing Job ด้วย PySparkProcessor พร้อม deequ-1.0.3-rc2.jar

**ต้องมีไฟล์ deequ-1.0.3-rc2.jar อยู่ในโฟลเดอร์เดียวกับโน้ตบุ๊ก (หรือ path ที่คุณระบุใน submit_jars)**

In [1]:
# 3.x Run Deequ data quality checks with Spark Processing Job

import boto3
import sagemaker
from sagemaker.spark.processing import PySparkProcessor
from time import gmtime, strftime

# --- Session / basic setup ---
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
bucket = sess.default_bucket()

print("Region:", region)
print("SageMaker default bucket:", bucket)



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Region: us-east-1
SageMaker default bucket: sagemaker-us-east-1-423623839320


In [3]:
# --- Input data path on S3 (CSV ที่เราเขียนขึ้นไปตอน Ingestion) ---

# พยายามโหลดตัวแปรจาก step ก่อนหน้าถ้าเคย %store ไว้
# ถ้าไม่มี จะ fallback ไปลายเซ็นมาตรฐาน: s3://<bucket>/retail-demand-forecasting/csv/
%store -r s3_private_path_csv

if "s3_private_path_csv" in globals():
    s3_input_data = s3_private_path_csv
    print("Using s3_private_path_csv from previous step:", s3_input_data)
else:
    s3_input_data = f"s3://{bucket}/retail-demand-forecasting/csv/"
    print("s3_private_path_csv not found, fallback to:", s3_input_data)

# --- Output prefix for Deequ results ---
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_prefix = f"retail-demand-deequ-{timestamp_prefix}"

# root path ที่สคริปต์จะใช้ต่อไปสร้างโฟลเดอร์ย่อย:
#   dataset-metrics, constraint-checks, success-metrics, constraint-suggestions
s3_output_analyze_data = f"s3://{bucket}/{output_prefix}/output"

print("Deequ output root (S3):", s3_output_analyze_data)


Using s3_private_path_csv from previous step: s3://sagemaker-us-east-1-423623839320/retail-demand-forecasting/csv/
Deequ output root (S3): s3://sagemaker-us-east-1-423623839320/retail-demand-deequ-2025-12-01-09-17-02/output


In [4]:

# --- Define PySparkProcessor for Deequ job ---
processor = PySparkProcessor(
    base_job_name="spark-retail-demand-analyzer",  # prefix ของชื่อ ProcessingJob
    role=role,
    framework_version="2.4",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    max_runtime_in_seconds=900,  # 15 นาที
    sagemaker_session=sess,
)

# --- Run Spark + Deequ processing job ---
# ต้องมี:
#   - preprocess-deequ-pyspark.py   (สคริปต์ที่เรียก pydeequ)
#   - deequ-1.0.3-rc2.jar          (JAR ของ Deequ)
processor.run(
    submit_app="preprocess-deequ-pyspark.py",
    submit_jars=["deequ-1.0.3-rc2.jar"],
    arguments=[
        "s3_input_data", s3_input_data,
        "s3_output_analyze_data", s3_output_analyze_data,
    ],
    logs=True,   # แสดง log ของ Spark/Deequ ใน notebook
    wait=True    # รอจน job จบ
)



INFO:sagemaker:Creating processing-job with name spark-retail-demand-analyzer-2025-12-01-09-18-16-429


...........12-01 09:19 smspark.cli  INFO     Parsing arguments. argv: ['/usr/local/bin/smspark-submit', '--jars', '/opt/ml/processing/input/jars', '/opt/ml/processing/input/code/preprocess-deequ-pyspark.py', 's3_input_data', 's3://sagemaker-us-east-1-423623839320/retail-demand-forecasting/csv/', 's3_output_analyze_data', 's3://sagemaker-us-east-1-423623839320/retail-demand-deequ-2025-12-01-09-17-02/output']
12-01 09:19 smspark.cli  INFO     Raw spark options before processing: {'jars': '/opt/ml/processing/input/jars', 'class_': None, 'py_files': None, 'files': None, 'verbose': False}
12-01 09:19 smspark.cli  INFO     App and app arguments: ['/opt/ml/processing/input/code/preprocess-deequ-pyspark.py', 's3_input_data', 's3://sagemaker-us-east-1-423623839320/retail-demand-forecasting/csv/', 's3_output_analyze_data', 's3://sagemaker-us-east-1-423623839320/retail-demand-deequ-2025-12-01-09-17-02/output']
12-01 09:19 smspark.cli  INFO     Rendered spark options: {'jars': '/opt/ml/processing/



In [5]:
!mkdir -p generated_deequ_report
!aws s3 cp --recursive $s3_output_analyze_data ./generated_deequ_report/


download: s3://sagemaker-us-east-1-423623839320/retail-demand-deequ-2025-12-01-09-17-02/output/dataset-metrics/_SUCCESS to generated_deequ_report/dataset-metrics/_SUCCESS
download: s3://sagemaker-us-east-1-423623839320/retail-demand-deequ-2025-12-01-09-17-02/output/dataset-metrics/part-00000-6b7c4ef0-8678-4254-828f-361a8535d4c8-c000.csv to generated_deequ_report/dataset-metrics/part-00000-6b7c4ef0-8678-4254-828f-361a8535d4c8-c000.csv


In [10]:
!aws s3 ls  $s3_output_analyze_data

                           PRE output/


In [11]:
print("\n✅ Deequ Spark Processing job completed.")
print("🔎 Deequ result root S3 prefix:", s3_output_analyze_data)

# เก็บ path นี้ไว้ใช้โหลดผลลัพธ์ใน notebook อื่น/ภายหลัง
%store s3_output_analyze_data



✅ Deequ Spark Processing job completed.
🔎 Deequ result root S3 prefix: s3://sagemaker-us-east-1-423623839320/retail-demand-deequ-2025-12-01-09-17-02/output
Stored 's3_output_analyze_data' (str)
