# Validasi Data menggunakan GX

Pada tahap ini, kita mulai menggunakan *Great Expectations (GX)* untuk melakukan validasi data. Tujuannya adalah memastikan bahwa dataset `P2M3_Rajib_Kurniawan_data_clean.csv` sudah memenuhi kriteria kualitas data yang diharapkan.  
Beberapa langkah yang dilakukan:
- Membuat *Expectation Suite*
- Membuat *Validator*
- Menjalankan serangkaian *Expectations* untuk memvalidasi dataset

# Import Libraries

In [None]:
# # Install the library

# !pip install -q "great-expectations==0.18.19"

In [2]:
import pandas as pd
import great_expectations as gx
from great_expectations.data_context import FileDataContext
from great_expectations.data_context import FileDataContext

# Data Loading

In [3]:
df = pd.read_csv("P2M3_Rajib_Kurniawan_data_clean.csv")
df.head()

Unnamed: 0,unnamed_0,customer_id,age,gender,item_purchased,category,purchase_amount,location,size,color,season,review_rating,subsciption_status,shipping_type,discount_applied,promo_code_used,previous_purchases,payment_method,frequency_of_purchases,column1
0,0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly,0
1,1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly,1
2,2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly,2
3,3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly,3
4,4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually,4


# Inisialisasi GX context

In [4]:
context = gx.get_context()

# Datasource untuk CSV

In [5]:
# Datasource untuk CSV
datasource_name = "csv-data-rajib"
datasource = context.sources.add_pandas(datasource_name)

In [6]:
# Data asset untuk dataset
asset_name = "p2m3_rajib_kurniawan_data_clean"
asset = datasource.add_dataframe_asset(asset_name, dataframe=df)

In [7]:
# Buat batch request
batch_request = asset.build_batch_request()


print(f"Datasource name: {datasource_name}")
print(f"Asset name: {asset_name}")


Datasource name: csv-data-rajib
Asset name: p2m3_rajib_kurniawan_data_clean


# Great Expectations

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   unnamed_0               3900 non-null   int64  
 1   customer_id             3900 non-null   int64  
 2   age                     3900 non-null   int64  
 3   gender                  3900 non-null   object 
 4   item_purchased          3900 non-null   object 
 5   category                3900 non-null   object 
 6   purchase_amount         3900 non-null   int64  
 7   location                3900 non-null   object 
 8   size                    3900 non-null   object 
 9   color                   3900 non-null   object 
 10  season                  3900 non-null   object 
 11  review_rating           3900 non-null   float64
 12  subsciption_status      3900 non-null   object 
 13  shipping_type           3900 non-null   object 
 14  discount_applied        3900 non-null   

In [9]:
# Membuat Expectation Suite & Validator
suite_name = "rajib_data_suite"
suite = context.add_expectation_suite(expectation_suite_name=suite_name)

validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name=suite_name)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,unnamed_0,customer_id,age,gender,item_purchased,category,purchase_amount,location,size,color,season,review_rating,subsciption_status,shipping_type,discount_applied,promo_code_used,previous_purchases,payment_method,frequency_of_purchases,column1
0,0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly,0
1,1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly,1
2,2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly,2
3,3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly,3
4,4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually,4


In [10]:
# 1. Expectation: Nilai Unik (`expect_column_values_to_be_unique`)
# Memastikan bahwa kolom `previous_purchases` tidak memiliki nilai duplikat.


validator.expect_column_values_to_be_unique("customer_id")


Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Kolom ini dipilih karena mewakili identitas unik setiap pelanggan dan jika ada duplikasi, maka bisa menandakan kesalahan input atau pencatatan ganda.

In [11]:
# 2. Expectation: Nilai Dalam Rentang (`expect_column_values_to_be_between`)
# Memastikan kolom `age` berada pada rentang realistis (10–100 tahun).

validator.expect_column_values_to_be_between("age", min_value=10, max_value=100)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Validasi ini penting untuk menghindari data ekstrem atau kesalahan pencatatan umur yang tidak masuk akal (misalnya 0 atau 200 tahun).


In [12]:
# 3. Expectation: Nilai Dalam Set Tertentu (`expect_column_values_to_be_in_set`)
# Memastikan kolom `gender` hanya berisi nilai `Male` atau `Female`.

validator.expect_column_values_to_be_in_set("gender", ["Male", "Female"])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Tujuannya adalah menjaga konsistensi data agar tidak ada ejaan berbeda seperti "male ", "femle", atau "M".

In [13]:
#4. Expectation: Validasi Tipe Data (`expect_column_values_to_be_in_type_list`)
# Pastikan `purchase_amount` bertipe numerik (`int` atau `float`).

validator.expect_column_values_to_be_in_type_list("purchase_amount", ["int", "float", "int64", "float64"])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Hal ini memastikan data bisa diproses untuk perhitungan matematis seperti rata-rata atau total pembelian.

In [14]:
#5. Expectation: Kolom Harus Ada (`expect_column_to_exist`)
# Memastikan kolom `payment_method` ada dalam dataset.

validator.expect_column_to_exist("payment_method")

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": true,
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Jika kolom ini hilang, maka analisis terkait metode pembayaran tidak bisa dilakukan.

In [15]:
#6. Expectation: Rata-Rata Nilai Dalam Batas Wajar (`expect_column_mean_to_be_between`)
# Memastikan rata-rata `purchase_amount` berada dalam kisaran wajar (20–500 USD).

validator.expect_column_mean_to_be_between("purchase_amount", min_value=20, max_value=500)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 59.76435897435898
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Jika rata-rata terlalu kecil atau besar, bisa menandakan adanya data yang tidak valid.

In [16]:
# 7. Expectation: Nilai Minimum Wajar (`expect_column_min_to_be_between`)
# Pastikan nilai pembelian terendah (`purchase_amount`) tidak kurang dari 1 USD.

validator.expect_column_min_to_be_between("purchase_amount", min_value=1)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 20
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Hal ini berguna untuk memastikan tidak ada transaksi yang bernilai nol atau negatif. Nilai pembelian 0 atau negatif bisa menandakan kesalahan input atau bug sistem.


In [17]:
#Save Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)
print("Expectation suite berhasil disimpan")

Expectation suite berhasil disimpan


# Checkpoint

In [18]:
# Create a checkpoint

checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

Tujuannya adalah mengeksekusi semua pengujian sekaligus dan menyimpan hasilnya ke dalam laporan.

In [19]:
# Run a checkpoint

checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/26 [00:00<?, ?it/s]

# Data Docs

In [None]:
# Build data docs

context.build_data_docs() 

{'local_site': 'file://C:\\Users\\kurni\\AppData\\Local\\Temp\\tmpxs5id5pl\\index.html'}

### Kesimpulan Validasi
Berdasarkan hasil validasi menggunakan *Great Expectations*, seluruh kolom yang diuji telah memenuhi standar kualitas data yang ditetapkan. Tidak ditemukan anomali signifikan pada kolom yang diuji, sehingga dataset ini dinyatakan **layak digunakan** untuk tahap analisis dan visualisasi selanjutnya di Kibana Dashboard.
