# **I. Import Library Packages**

In [1]:
# Import Library Packages
import pandas as pd
import great_expectations as ge
from great_expectations.data_context import FileDataContext

By importing the Library Packages, it can be run smoothly with the requirement tasks.

# **II. Create Data Context**

In [None]:
# Create data context
context = FileDataContext.create(project_root_dir='./')

# **III. Create Data Source and Asset**

In [None]:
# Create data source
datasource = context.sources.add_pandas('main-datasource')

# Create data asset
path = './data/P2M3_Maulana_Yusuf_data_clean.csv'
asset = datasource.add_csv_asset('mobile-device-usage', filepath_or_buffer=path)

# Build batch request
batch_request = asset.build_batch_request()

**`Insights`**:
- **`main-datasource`**: Create a datasource called **'main-datasource'** that Pandas will utilize to read data.
- **`mobile-device-usage`**: Add asset data named **'mobile-device-usage'** from a local CSV file. The path variable stores the path of the file.
- **`batch_request`**: Batch request is a configuration used to read data (for example for validators).

# **IV. Create an Expectation Suite**

In [None]:
# Create an expectation suite
context.add_or_update_expectation_suite('gx_val_suite')

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = 'gx_val_suite'
)

# Show the output
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,user_id,device_model,operating_system,app_usage_time_(min/day),screen_on_time_(hours/day),battery_drain_(mAh/day),number_of_apps_installed,data_usage_(MB/day),age,gender,user_behavior_class
0,1,Google Pixel 5,Android,393,6.4,1872,67,1122,40,Male,4
1,2,OnePlus 9,Android,268,4.7,1331,42,944,47,Female,3
2,3,Xiaomi Mi 11,Android,154,4.0,761,32,322,42,Male,2
3,4,Google Pixel 5,Android,239,4.8,1676,56,871,20,Male,3
4,5,iPhone 12,iOS,187,4.3,1367,58,988,31,Female,3


**`Insights`**:
- **`context.add_or_update_expectation_suite('gx_val_suite')`**: To hold all of the rules (expectations) needed to validate the data, create a new Expectation Suite called **`gx_val_suite`**.
- **`validator = context.get_validator(...)`**: This creates a validator object that is used to apply various expectations to batch of data. Additionally, **`batch_request`** is the request for a batch of data from the datasource that created earlier, while the **`expectation_suite_name`** associates the validator with the suite where the expectations will be stored.
- **`validator.head()`**: Displays the first 5 rows of data fetched by the validator. This is just to check if the batch has been read successfully.

# **V. Create the Validators**

## **V.I. Mandatory Expectations**

### **A. To be Unique**

In [7]:
# Create "values to be unique" validator
validator.expect_column_values_to_be_unique('user_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 700,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**`Insights`**:<p>
This expectation is used to ensure that all values ​​in the user_id column are unique, meaning there are no duplicates. If all values ​​in the **`user_id`** column are unique, then the result will be success: true

### **B. To be between Min_value and Max_value**

In [None]:
# Create "values to be between min_value and max_value" validator
validator.expect_column_values_to_be_between('screen_on_time_(hours/day)',
                                             min_value = 0, max_value = 24
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 700,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**`Insights`**:
- The column contains no values that fall outside of the 0–24 hour range.
- Since every element satisfies the requirements, there are no unexpected or missing values.

**`Summary of the results:`**
- Total elements: **700**
- Number of unexpected elements: **0**
- Percentage of unexpected elements: **0.0%**
- Number of missing values: **0**
- Percentage of missing values: **0.0%**

### **C. To be in Set**

In [11]:
# Create "values_to_be_in_set" validator
validator.expect_column_values_to_be_in_set('operating_system',
                                            value_set = ['Android', 'iOS'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 700,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**`Summary`**:<p>
The result of the **`expect_column_values_to_be_in_set`** validation shows that the validation was successful with the result **success: true**, which means that all values ​​in the operating_system column are in the specified set, namely **`['Android', 'iOS']`**.

**`Insights`**:<p>
- No values exist outside of the designated set. One of the two permitted values Android or iOS is present in every value in the **`operating_system`** column.
- Since every element satisfies the requirements, there are no unexpected or missing values.

### **D. To be in Type List**

In [None]:
# Create "to_be_in_type_list" validator
validator.expect_column_values_to_be_in_type_list('app_usage_time_(min/day)',
                                                  type_list = ['int']
)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**`Insights`**: <p>
**`expect_column_values_to_be_in_type_list`**'s "success": false and "observed_value": "int64" results highlight that the app_usage_time_(min/day) column's actual data type is int64 rather than "int" as specified in the type_list.<p>
The corrected code will be: **`type_list=['int64']`**. The result will be **"success": true**.

## **V.II. Other Expectations**

### **A. Values X To be Greater than Y**

In [14]:
# Create "values_x_to_be_greater_than_y" validator
validator.expect_column_pair_values_A_to_be_greater_than_B(column_A = 'battery_drain_(mAh/day)', 
                                                           column_B = 'screen_on_time_(hours/day)'
)

Calculating Metrics:   0%|          | 0/7 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 700,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**`Summary`**:<p>
The result of the validation **`expect_column_pair_values_A_to_be_greater_than_B`** shows that the validation was successful with the result **`success: true`**. This means that the value in the **`battery_drain_(mAh/day)`** column is always greater than the value in the **`screen_on_time_(hours/day)`** column for each relevant data pair.

**`Insights`**:
- There are no value pairs that violate expectations, meaning that for each row of data, the **`battery_drain_(mAh/day)`** value is always greater than the **`screen_on_time_(hours/day)`**.
- There are no missing or unexpected values, meaning that the data is complete and valid according to the rules that have been determined.

### **B. Column Mean with Range**

In [15]:
# Create "column_mean_with_range" validator
validator.expect_column_mean_to_be_between('screen_on_time_(hours/day)',
                                           min_value = 1, max_value = 10)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 5.272714285714286
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**`Summary`**:<p>
The **`expect_column_mean_to_be_between`** validator produces results that demonstrate the success of the expectation **("success": true)**. In other words, the **`screen_on_time_(hours/day)`** column's average value falls between 1 and 10, which is the expected range.

**`Insights`**:<p>
Observed average (observed_value): **5.27** hours per day. This is in the middle of the specified range (between 1 and 10), indicating that the data is **fairly** stable and within reasonable limits for daily screen time.

### **C. Value Lengths to be Between**

In [None]:
# Create "value_lengths_to_be_between" validator
validator.expect_column_value_lengths_to_be_between('operating_system',
                                                    min_value = 3)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 700,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**`Summary`**:<p>
The result of expectation **`expect_column_value_lengths_to_be_between`** shows that all values ​​in the operating_system column have a string length (number of characters) of at least 3 or more, as per your expectation **`(min_value=3)`**, and no values ​​fail to meet this requirement.

**`Insights`**:<p>
- The number of characters in each string value in the column is checked by this function.
- "Android" and "iOS" are the only two distinct values in the operating_system field, and they are both at least three lengths.
- "iOS": three characters
- "Android" has seven characters.
- The outcome is "success": true since all of the data is legitimate.

# **VI. Save into Expectation Suite**

In [None]:
# Save into Expectation Suite
validator.save_expectation_suite(discard_failed_expectations = False)

# **VII. Run Checkpoints and Build Docs**

In [18]:
# Create a checkpoint
checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

# Run a checkpoint
checkpoint_result = checkpoint_1.run()

# Build data docs
context.build_data_docs()

Calculating Metrics:   0%|          | 0/37 [00:00<?, ?it/s]

{'local_site': 'file://c:\\gx\\uncommitted/data_docs/local_site/index.html'}

# **VIII. Conclusion**

The notebook effectively illustrates that Great-Expectations (GX) data validation may be carried out effectively without the requirement for Docker or other intricate setups. The entire procedure is carried out locally using a standard Python environment and the FileDataContext technique, which organizes and stores GX setup and information within the project folder. Although during the program there was a slight error regarding its contents, namely from int to int64. The rest of the coding runs (almost) perfectly.

**FYI**: I had already done by doing the visualization using Kibana. But somehow, when I load the data it shows 2800 hits. The actual dataset is 700 hits. I created with 2800 hits with complete explaination. The hits should be divided by 4: 2800 / 4. I will explain later on **Milestone 3 Presentation** about this for the actual data.