# **Great Expectation**

---

## **Perkenalan**


---

Nama : Darly Guntur Darris Purba

Batch : RMT-031

---

## **Objektif**


---

Melakukan validasi dan memastikan keakuratan dengan *great expectation dengan menggunakan data yang sudah dilakukan *cleaning* dengan menggunakan 7 parameter ekpektasi

---

## ***Install & Import Library***


---

In [1]:
# Install the library

!pip install -q great-expectations

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import pandas as pd

---

## ***Data Context***


---

In [3]:
from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

In [4]:
from google.colab import drive
drive.mount('/content/drive')

  and should_run_async(code)



Mounted at /content/drive


***Insight:***

File data context akan digunakan untuk menyimpan semua konfigurasi dan hasil pengujian di direktori yang telah ditentukan.

---


## ***Data Loading***


---

In [5]:
df = pd.read_csv('/content/drive/MyDrive/M3/P2M3_Darly_Purba_df_clean.csv')

In [6]:
df.info()

  and should_run_async(code)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129487 entries, 0 to 129486
Data columns (total 24 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   satisfaction                       129487 non-null  object 
 1   gender                             129487 non-null  object 
 2   customer_type                      129487 non-null  object 
 3   age                                129487 non-null  int64  
 4   type_of_travel                     129487 non-null  object 
 5   class                              129487 non-null  object 
 6   flight_distance                    129487 non-null  int64  
 7   seat_comfort                       129487 non-null  int64  
 8   departure_arrival_time_convenient  129487 non-null  int64  
 9   food_and_drink                     129487 non-null  int64  
 10  gate_location                      129487 non-null  int64  
 11  inflight_wifi_service              1294

In [7]:
# Mendapatkan nilai unik dari kolom 'age' dan mengonversinya menjadi list
unique_ages = df['age'].unique().tolist()

# Hitung nilai rata-rata dari daftar nilai unik
mean_age = sum(unique_ages) / len(unique_ages)
mean_age

  and should_run_async(code)



44.053333333333335

***Insight:***

Perhitungan ini akan dijadikan *trial & error* untuk *great expectation*

---


## ***Connect to Datasource***


---

In [8]:
# Memberi nama Data Source
datasource_name = 'Data Cleaned'
datasource = context.sources.add_pandas(datasource_name)

# Memberi Nama Data Asset
asset_name = 'Project M3'
path_to_data = '/content/drive/MyDrive/M3/P2M3_Darly_Purba_df_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Membangun Batch Request
batch_request = asset.build_batch_request()

*Insight:*

Instruksi ini bertujuan untuk menambahkan data source dan data asset menggunakan library Great Expectations, dan kemudian membuat batch request untuk data tersebut

---

## ***Expectation Suite***


---

In [9]:
# Membuat Expectation Suite
expectation_suite_name = 'Expectation Project M3'
context.add_or_update_expectation_suite(expectation_suite_name)

# Membuat Validator Menggunakan Expectation Suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Menampilkan Validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,satisfaction,gender,customer_type,age,type_of_travel,class,flight_distance,seat_comfort,departure_arrival_time_convenient,food_and_drink,...,ease_of_online_booking,on_board_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding,departure_delay_in_minutes,arrival_delay_in_minutes,id
0,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,...,3,3,0,3,5,3,2,0,0.0,Eco2650
1,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,...,3,4,4,4,2,3,2,310,305.0,Business24641
2,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,...,2,3,3,4,4,4,2,0,0.0,Eco21382
3,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,...,1,1,0,1,4,1,3,0,0.0,Eco6233
4,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,...,2,2,0,2,4,2,5,0,0.0,Eco3544


***Insight:***

Adapun expectation yang ingin dicari adalah:

- to be unique
- to be between min value and max value
- to be in set
- to be in type list
- column max to be between
- column mean to be between
- column values to not be Null

---


### ***to be unique***

---

In [10]:
# Expectation to be Unique
validator.expect_column_values_to_be_unique('id')

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "id",
      "batch_id": "Data Cleaned-Project M3"
    },
    "meta": {}
  },
  "result": {
    "element_count": 129487,
    "unexpected_count": 2,
    "unexpected_percent": 0.0015445565964150843,
    "partial_unexpected_list": [
      "Eco11397628",
      "Eco11397628"
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0015445565964150843,
    "unexpected_percent_nonmissing": 0.0015445565964150843
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

***Insight:***

Pada `to be unique` telah dilakukan pembuatan kolom `id` pada DAG dengan mengabungkan kolom `class` dan `flight_distance`. Hasilnya menunjukkan ternyata masih ada kolom yang terduplikasi.

---


### ***to be between min_value and max_value***

---

In [11]:
# Expectation to be between Min Value and Max Value
validator.expect_column_values_to_be_between(
    column='age',
    min_value=1,
    max_value=100
)

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "age",
      "min_value": 1,
      "max_value": 100,
      "batch_id": "Data Cleaned-Project M3"
    },
    "meta": {}
  },
  "result": {
    "element_count": 129487,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

***Insight:***

Kolom `age` akan dilakukan validasi bedasarkan rentang min dan maxnya dengan nilai 0 sampai 100 tahun. Hasilnya menunjukkan nilai kolom berada dalam rentang tersebut.

---


### ***to be in set***

---

In [12]:
# Expectation to be In Set
validator.expect_column_values_to_be_in_set(
    column='satisfaction',
    value_set={'satisfied', 'dissatisfied'}
)

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "column": "satisfaction",
      "value_set": [
        "dissatisfied",
        "satisfied"
      ],
      "batch_id": "Data Cleaned-Project M3"
    },
    "meta": {}
  },
  "result": {
    "element_count": 129487,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

***Insight:***

Kolom `satisfaction` akan dilakukan validasi bedasarkan jenis datanya **Satisfied** atau **Disatisfied**. Hasilnya menunjukkan terdapat 2 kata tersebut dalam kolom `satisfaction`.

---


### ***to be in type list***

---

In [13]:
# Expectation to be in Type List
validator.expect_column_values_to_be_of_type(
    column='id',
    type_='object'
)


  and should_run_async(code)




Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "id",
      "type_": "object",
      "batch_id": "Data Cleaned-Project M3"
    },
    "meta": {}
  },
  "result": {
    "observed_value": "object_"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

***Insight:***

Pada `to be unique` telah dilakukan pembuatan kolom id pada DAG dengan mengabungkan kolom `class` dan `flight_distance`. Hasilnya menunjukkan bahwa kolom ini memiliki tipe data *object*.

---


### ***Column Max to be Between***

---

In [14]:
# Expectation Column Max to be Between
validator.expect_column_max_to_be_between('cleanliness', 0, 5)

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_max_to_be_between",
    "kwargs": {
      "column": "cleanliness",
      "min_value": 0,
      "max_value": 5,
      "batch_id": "Data Cleaned-Project M3"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 5
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

***Insight:***

`Column Max to be Between` memiliki ekspektasi kolom maksimum berada di antara nilai minimum dan nilai maksimum.
Kolom `cleanliness` merupakan parameter penilaian untuk standar kebersihan dengan rentang nilai 0-5. Hasilnya menunjukkan kolom tersebut memiliki nilai rentang seperti demikian.

---


### ***Column Mean to be Between***

---

In [15]:
# Expectation Column Mean to be Between
expected_mean = 44
tolerance = 5
validator.expect_column_mean_to_be_between(column='age', min_value=expected_mean-tolerance, max_value=expected_mean+tolerance)

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "age",
      "min_value": 39,
      "max_value": 49,
      "batch_id": "Data Cleaned-Project M3"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 39.42876118838185
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

***Insight:***

`Column Mean to be Between` memiliki ekspektasi kolom tertentu memiliki rata-rata yang diharapkan.
Kolom `age` merupakan parameter umur dari pengguna jasa yang sebelumnya sudah dilakukan perhitungan untuk rata-rata. Hasilnya menunjukkan kolom tersebut memiliki rata-rata yang sama dengan hasil perhitungan manual.

---


### ***Column Values to not be Null***

---

In [16]:
# Expectation Column Values to not be Null
validator.expect_column_values_to_not_be_null(
    column="arrival_delay_in_minutes",
    mostly=1
)

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "arrival_delay_in_minutes",
      "mostly": 1,
      "batch_id": "Data Cleaned-Project M3"
    },
    "meta": {}
  },
  "result": {
    "element_count": 129487,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

***Insight:***

`ColumnValues to not be Null` memiliki ekspektasi kolom tertentu tidak memiliki *missing value*. Sebelumnya di DAG telah dilakukan *handling*, oleh sebab itu akan dilakukan pemeriksaan lagi memastikan sudah bersih. Pada mostly diberikan angka 1 yang menandakan 100 % sudah bersih.
Kolom `arrival_delay_in_minutes` dipilih sebagai contoh untuk dilakukan pemeriksaan. Hasilnya menunjukkan kolom tersebut tidak memiliki nilai *missing value*

---


## ***Checkpoint***


---

In [23]:
# Menyimpan ke dalam Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)

In [24]:
# Membuat Checkpoint
checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [25]:
# Menjalankan Checkpoint
checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/31 [00:00<?, ?it/s]

***Insight:***

Instruksi ini bertujuan untuk menyimpan suite ekspektasi ke dalam Great Expectations, membuat checkpoint untuk menjalankan validasi, dan kemudian menjalankan checkpoint tersebut.

---

## ***Data Docs***

---

In [26]:
# Membuat data docs
context.build_data_docs()

{'local_site': 'file:///content/gx/uncommitted/data_docs/local_site/index.html'}

***Insight:***

Instruksi context.build_data_docs() bertujuan untuk membangun dokumentasi data menggunakan Great Expectations. Data docs adalah halaman HTML yang memberikan laporan komprehensif tentang validasi data, ekspektasi, dan hasil validasi.

---