# **FINAL PROJECT**
---

## Import Libraries

In [11]:
import great_expectations as ge
from great_expectations.data_context import get_context

## Initiate Data Context

In [12]:
# create data context
context = get_context()

## Connect to Datasource

In [13]:
# Give name to a Datasource
datasource_name = "skincare_product"
datasource = context.sources.add_pandas(datasource_name)

# Give name to a data asset
asset_name = "product_data"
asset = datasource.add_csv_asset(
    name=asset_name,
    filepath_or_buffer="skincare_clean.csv" # type: ignore
)

# Build batch request
batch_request = asset.build_batch_request()

## Create Expectation Suite

In [14]:
# Create an expectation suite
expectation_suite_name = "skincare_data_suite"
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,product_type,brand,product,rating,review_count,url,image_url,price,description,unique_id
0,cleanser,Cetaphil,Gentle Skin Cleanser,4.2,15298,https://reviews.femaledaily.com/products/clean...,https://image.femaledaily.com/dyn/210/images/p...,112000.0,Cetaphil Gentle Skin Cleanser mengandung formu...,Cetaphil - Gentle Skin Cleanser
1,cleanser,Senka,Perfect Whip Facial Foam,4.2,6374,https://reviews.femaledaily.com/products/clean...,https://image.femaledaily.com/dyn/210/images/p...,200000.0,Shiseido Perfect Whip adalah pembersih wajah u...,Senka - Perfect Whip Facial Foam
2,cleanser,Acnes,Creamy Wash,3.7,5510,https://reviews.femaledaily.com/products/clean...,https://image.femaledaily.com/dyn/210/images/p...,29000.0,Acnes Creamy Wash adalah sabun pembersih wajah...,Acnes - Creamy Wash
3,cleanser,Hada Labo,Tamagohada Mild Peeling Face Wash,4.1,5147,https://reviews.femaledaily.com/products/clean...,https://image.femaledaily.com/dyn/210/images/p...,35000.0,Hada Labo Tamagohada Mild Peeling Face Wash ad...,Hada Labo - Tamagohada Mild Peeling Face Wash
4,cleanser,Hada Labo,Gokujyun Ultimate Moisturizing Face Wash,4.3,4468,https://reviews.femaledaily.com/products/clean...,https://image.femaledaily.com/dyn/210/images/p...,30000.0,Hada Labo Gokujyun Ultimate Moisturizing Face ...,Hada Labo - Gokujyun Ultimate Moisturizing Fac...


## Expectation

### **Expectation 1 - Column Values to not be Null**

In [15]:
validator.expect_column_values_to_not_be_null('description')

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 788,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

*Informasi :*
- Kolom `description_clean` tidak boleh memiliki nilai kosong (missing value) karena kolom ini menjadi input utama dalam proses text similarity.
Jika terdapat missing value, maka hasil perhitungan kemiripan teks (cosine similarity atau model embedding) akan menjadi tidak valid atau menimbulkan error saat proses vektorisasi.

### **Expectation 2 - Column Values to be Unique**

In [16]:
validator.expect_column_values_to_be_unique('unique_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 788,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

*Informasi :*
Pada dataset sebelumnya, tidak terdapat kolom unik (unique identifier) yang dapat membedakan setiap baris data produk.
Oleh karena itu, pada tahap data cleaning ditambahkan satu kolom unik yang merupakan kombinasi dari kolom `brand` dan `product` (ditambahkan index untuk hasil yang masih duplikat).  

Kolom ini berfungsi untuk:   
- Menghindari duplikasi data (karena beberapa produk memiliki nama serupa di brand berbeda).  
- Mempermudah proses identifikasi saat dilakukan merge, join, atau lookup pada tahap analisis.  
- Menjadi acuan utama saat melakukan pengecekan kualitas data (data validation), seperti memastikan tidak ada nilai ganda (duplicate values).

### **Expectation 3 - Column Values to be Between (Min - Max Value)**

In [17]:
validator.expect_column_values_to_be_between(
    column='rating', min_value=0, max_value=5
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 788,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

*Informasi :*
- Kolom rating pada website memiliki range nilai antara 0 hingga 5, di mana angka 0 menunjukkan penilaian terendah dan 5 merupakan penilaian tertinggi dari pengguna.
Oleh karena itu, dilakukan validasi untuk memastikan nilai yang ada pada kolom rating berada dalam rentang yang wajar dan valid, sehingga tidak terdapat nilai di bawah 0 atau di atas 5.

### **Expectation 4 - Column Values to be Between (Min Value)**

In [18]:
validator.expect_column_values_to_be_between(
    "price",
    min_value=0
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 788,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

*Informasi :*
- Kolom price tidak boleh memiliki nilai negatif karena harga produk tidak dapat bernilai kurang dari nol.
Oleh karena itu, dilakukan validasi menggunakan expectation ini.

## Save into Expectation Suite

In [19]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)

## Checkpoint

In [20]:
# Create a checkpoint

checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [21]:
# Run a checkpoint

checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/29 [00:00<?, ?it/s]