# Milestone 3

---

- **Objective**: Program ini dibuat untuk memvalidasi data clean dengan menggunakan library great expectation.

---

# Import library

In [21]:
# Import library
from great_expectations.data_context import FileDataContext

# Initialize Validation

In [22]:
# create a data context
context = FileDataContext.create(project_root_dir='./')

# Give a name to datasource. Name must be unique between datasource.
datasource_name = 'csv-dataclean-m3'

# Jika kode dirun berulang kali, tidak akan error karena datasource name sama.
if context.datasources:
    context.delete_datasource(datasource_name)
datasource = context.sources.add_pandas(datasource_name)

# Give a name to data asset
asset_name = 'asset-dataclean-m3'
path_to_data = 'P2M3_Syihabuddin_Ahmad_data_clean.csv'
asset = datasource.add_csv_asset(asset_name,filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

# Create an expectation suite
expectation_suite_name = 'expectation-suite-m3'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name=expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,player_id,age,gender,location,game_genre,play_time_hours,in_game_purchases,game_difficulty,sessions_per_week,avg_session_duration_minutes,player_level,achievements_unlocked,engagement_level
0,9000,43,Male,Other,Strategy,16.27,No,Medium,6,108,79,25,Medium
1,9001,29,Female,USA,Strategy,5.53,No,Medium,5,144,11,10,Medium
2,9002,22,Female,USA,Sports,8.22,No,Easy,16,142,35,41,High
3,9003,35,Male,USA,Action,5.27,Yes,Easy,9,85,57,47,Medium
4,9004,33,Male,Europe,Action,15.53,No,Medium,2,131,95,37,Medium


# Validating

## 1. Column player_id must be unique

In [23]:
# validate column player_id
validator.expect_column_values_to_be_unique('player_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 40034,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Pengecekan di atas dilakukan supaya tidak ada data duplikat dari data yang didapatkan. Hasil validasi menghasilkan `"success": True`

## 2. Column play time hours must be between min and max value

In [24]:
# Validate column play time hours
validator.expect_column_values_to_be_between('play_time_hours',0,24)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 40034,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Kolom tersebut harus memiliki value di antara 0 dan 24, karena dalam sehari hanya terdapat 24 jam.

## 3. Column in_game_purchases must be in set

In [25]:
# Validate
validator.expect_column_values_to_be_in_set('in_game_purchases',['Yes','No'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 40034,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Karena kolom tersebut menjelaskan seorang player melakukan pembelian dalam game atau tidak, maka dari itu ekspektasinya hanya yes or no, artinya melakukan atau tidak melakukan pembayaran di dalam game.

## 4. Column age must be in type of int

In [26]:
# Validate
validator.expect_column_values_to_be_in_type_list('age',['int64'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Karena umur pada umumnya ditulis dengan angka, bukan huruf dan tidak umum untuk ditulis mencantumkan koma, seperti 22.3 tahun, biasanya cukup ditulis dengan 22 tahun.

## 5. Column avg_session_duration_minutes should not be null

In [27]:
# Validate
validator.expect_column_values_to_not_be_null('avg_session_duration_minutes')

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 40034,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Karena kolom tersebut termasuk target penelitian, maka dari itu tidak diperbolehkan untuk null pada data clean.

# 6. Column count is as expected

In [28]:
# Validate
validator.expect_table_column_count_to_equal(13)

Calculating Metrics:   0%|          | 0/3 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 13
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Hal ini dilakukan supaya tidak ada additional kolom yang tiba-tiba terbentuk

## 7. Rows count is as expected

In [29]:
# Validate
validator.expect_table_row_count_to_equal(40034)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 40034
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Baris sudah sesuai total dari data clean, yang artinya tidak ada additional baris yang tiba-tiba muncul.

# Save Expectation Suite

In [30]:
# Save into expectation suite
validator.save_expectation_suite(discard_failed_expectations=False)

Secara default great expectation hanya menyimpan yang sukses, maka diset ke False supaya yang gagal juga tersimpan.

# Checkpoint

In [31]:
# Create checkpoint:
checkpoint_m3 = context.add_or_update_checkpoint(
    name='checkpoint_m3',
    validator = validator,
)

# Run checkpoint
checkpoint_result = checkpoint_m3.run()

Calculating Metrics:   0%|          | 0/30 [00:00<?, ?it/s]

Checkpoint digunakan untuk menggabungkan batch data validation dengan expectation suite.

# Data Docs

In [32]:
# Build data docs
context.build_data_docs()

{'local_site': 'file://c:\\Users\\Syihabuddin Ahmad\\Desktop\\git-clone\\p2-ftds017-hck-m3-Ayslove\\dags\\gx\\uncommitted/data_docs/local_site/index.html'}

Data docs digunakan untuk translate yang sudah dilakukan sebelumnya, menjadi bahasa yang bisa dibaca oleh seseorang.