# Great Expectation

# Objectives

This notebook aims to leverage Great Expectations for data validation to ensure the dataset meets quality standards before analysis. By integrating Great Expectations, we can profile the data, run checks to confirm data consistency, and generate comprehensive validation reports. This process helps maintain data integrity and prepares it for reliable downstream tasks.

# Import Libraries

In [3]:
import pandas as pd
import great_expectations as gx

Import libraries has been successfully performed.

# Data Loading

In [4]:
# Read CSV file and save into a DataFrame
df = pd.read_csv('clean_data.csv')

# Show dataset
df

Unnamed: 0,brand,model,operating_system,connectivity,display_type,display_size_(inches),resolution,water_resistance_(meters),battery_life_(days),heart_rate_monitor,gps,nfc,price_(usd)
0,Apple,Watch Series 7,watchOS,"Bluetooth, Wi-Fi, Cellular",Retina,1.90,396 x 484,50,18,Yes,Yes,Yes,$399
1,Samsung,Galaxy Watch 4,Wear OS,"Bluetooth, Wi-Fi, Cellular",AMOLED,1.40,450 x 450,50,40,Yes,Yes,Yes,$249
2,Garmin,Venu 2,Garmin OS,"Bluetooth, Wi-Fi",AMOLED,1.30,416 x 416,50,11,Yes,Yes,No,$399
3,Fitbit,Versa 3,Fitbit OS,"Bluetooth, Wi-Fi",AMOLED,1.58,336 x 336,50,6,Yes,Yes,Yes,$229
4,Fossil,Gen 6,Wear OS,"Bluetooth, Wi-Fi",AMOLED,1.28,416 x 416,30,24,Yes,Yes,Yes,$299
...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,Michael Kors,Access Bradshaw 2,Wear OS,"Bluetooth, Wi-Fi",AMOLED,1.28,416 x 416,30,2,Yes,Yes,Yes,$350
290,Casio,G-Shock GSW-H1000,Wear OS,"Bluetooth, Wi-Fi",LCD,1.20,360 x 360,200,1,Yes,Yes,No,$699
291,Withings,ScanWatch,Withings OS,"Bluetooth, Wi-Fi",PMOLED,1.38,348 x 442,50,30,Yes,No,Yes,$279
292,Oppo,Watch Free,ColorOS,"Bluetooth, Wi-Fi",AMOLED,1.64,326 x 326,50,14,Yes,No,Yes,$159


In [3]:
context = gx.get_context()

Data has been successfully loaded and converted to great expectation dataframe.

# Expectations

## Expectation 1: to be unique

Columns must have unique values ​​to ensure each entry in the dataset can be uniquely identified as this is important to avoid duplication, maintain data integrity, and allow data operations to function correctly.

In [6]:
# Get the batch data as a pandas DataFrame
batch_data = validator.active_batch.data.dataframe

# Create the new column named 'unique_id' in the DataFrame
batch_data["unique_id"] = batch_data["model"] + "_" + batch_data["operating_system"] + "_" + batch_data["connectivity"]

# Remove duplicate rows based on the 'unique_id' column
batch_data.drop_duplicates(subset=['unique_id'], inplace=True)

result_unique = validator.expect_column_values_to_be_unique("unique_id")
print(result_unique)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "unique_id",
      "batch_id": "smartwatch-smartwatch_prices"
    },
    "meta": {}
  },
  "result": {
    "element_count": 189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


The unique_id column in this batch of data has 189 elements and all of them are unique with no missing or unexpected values ​​with the percentage of unexpected and missing values ​​being 0.0% which indicates that every value in the column is unique.

## Expectation 2: to be between min_value and max_value

Columns must have values ​​between min_value and max_value to ensure that the data is within the expected range. This is important to maintain data validity and consistency, and detect errors based on accurate data.

In [7]:
# Get the batch data as a pandas DataFrame
batch_data = validator.active_batch.data.dataframe

# Convert the 'battery_life_(days)' column to numeric type
batch_data["battery_life_(days)"] = pd.to_numeric(batch_data["battery_life_(days)"], errors='coerce')

# Determine the actual minimum and maximum values
min_value = batch_data['battery_life_(days)'].min()
max_value = batch_data['battery_life_(days)'].max()

# Create the 'expect_column_values_to_be_between' column in the DataFrame
batch_data["expect_column_values_to_be_between"] = batch_data["battery_life_(days)"]

result_battery_life = validator.expect_column_values_to_be_between(
    column="expect_column_values_to_be_between", min_value=min_value, max_value=max_value
)
print(result_battery_life)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "expect_column_values_to_be_between",
      "min_value": 1.0,
      "max_value": 70.0,
      "batch_id": "smartwatch-smartwatch_prices"
    },
    "meta": {}
  },
  "result": {
    "element_count": 189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 3,
    "missing_percent": 1.5873015873015872,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


The expected_column_values_to_be_between column in this batch of data has 189 elements and all of them fall within the range of 1.0 to 70.0 with no unexpected values, but there are 3 missing values ​​from the total elements.

## Expectation 3: to be in set

Columns must have values ​​that fall within a certain set to ensure that the data is valid and as expected and this is important for maintaining data consistency, detecting errors, and for meeting security and regulatory compliance requirements.

In [8]:
# Get the batch data as a pandas DataFrame
batch_data = validator.active_batch.data.dataframe

# Get the unique values in the 'operating_system' column
unique_os_values = batch_data['operating_system'].unique()

result_os = validator.expect_column_values_to_be_in_set(
    column="operating_system",
    value_set=unique_os_values
)
print(result_os)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "column": "operating_system",
      "value_set": [
        "watchOS",
        "Wear OS",
        "Garmin OS",
        "Fitbit OS",
        "HarmonyOS",
        "ColorOS",
        "Amazfit OS",
        "Withings OS",
        "Polar OS",
        "Tizen OS",
        "Lite OS",
        "Tizen",
        "Suunto OS",
        "Proprietary OS",
        "Proprietary",
        "LiteOS",
        "Android Wear",
        "MIUI for Watch",
        "Custom OS",
        "Fossil OS",
        "MIUI",
        "RTOS",
        "MyKronoz OS",
        "Nubia OS",
        "Mi Wear OS",
        "Zepp OS",
        "Realme OS",
        "Matrix OS",
        "Android OS",
        "Casio OS",
        "Skagen OS",
        "Timex OS",
        "MIUI For Watch",
        "Android"
      ],
      "batch_id": "smartwatch-smartwatch_prices"
    },
    "meta": {}
  },
  "result": {
    "element_c

The operating_system column in this batch of data has 189 elements and all of them are in the expected set of values ​​with no unexpected or missing values ​​and the percentage of unexpected and missing values ​​is 0.0%.

## Expectation 4: to be in type list

Columns must have values ​​that are in the list data type to ensure that the data conforms to the expected structure and this is important to maintain data consistency, facilitate data manipulation and analysis, and ensure that data operations such as iteration and element access can be performed correctly.

In [None]:
# Validate that all values in the "display_size_(inches)"
result_display_type = validator.expect_column_values_to_be_of_type(
    column="display_size_(inches)",
    type_="float"
)
print(result_display_type)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "display_size_(inches)",
      "type_": "float",
      "batch_id": "smartwatch-smartwatch_prices"
    },
    "meta": {}
  },
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


The display_size_(inches) column in this batch of data has a float64 data type which matches the expected data type and this indicates that all values ​​in the column have the correct data type.

## Expectation 5: Column value length to be between

Columns must have a value length that falls within a certain range to ensure that the data fits within the expected boundaries and this is important to maintain consistency, prevent data entry errors, and ensure that the data can be used correctly.

In [10]:
# Get the batch data as a pandas DataFrame
batch_data = validator.active_batch.data.dataframe

# Determine the actual minimum and maximum string lengths
min_length = batch_data['brand'].str.len().min()
max_length = batch_data['brand'].str.len().max()

result_brand_length = validator.expect_column_value_lengths_to_be_between(
    column="brand", min_value=min_length, max_value=max_length
)
print(result_brand_length)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_value_lengths_to_be_between",
    "kwargs": {
      "column": "brand",
      "min_value": 2,
      "max_value": 14,
      "batch_id": "smartwatch-smartwatch_prices"
    },
    "meta": {}
  },
  "result": {
    "element_count": 189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


The brand column in this batch of data has 189 elements and all of them have a value length between 2 and 14 characters and this data has no unexpected or missing values, with the percentage of unexpected and missing values ​​being 0.0%.

## Expectation 6: Column values must not be null

Values ​​in a column must not be null to ensure that the data is complete without any missing values ​​and this is important to maintain data integrity, enable accurate analysis, and ensure data operations work correctly.

In [None]:
# Validate that all values in the "gps" column are NOT null (i.e., no missing values)
result_not_null = validator.expect_column_values_to_not_be_null("gps")
print(result_not_null)

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "gps",
      "batch_id": "smartwatch-smartwatch_prices"
    },
    "meta": {}
  },
  "result": {
    "element_count": 189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


The gps column in this data batch has 189 elements and all of them are not null. There are no unexpected values ​​with the percentage of unexpected values ​​being 0.0%.

## Expectation 7: Column value must match regex

The values ​​in the column must match a specific regex pattern to ensure that the data meets the expected format and this is important for validating data correctness, maintaining consistency, detecting errors, and preventing malicious data injection.

In [12]:
# Get the batch data as a pandas DataFrame
batch_data = validator.active_batch.data.dataframe

# Remove rows with null values in the 'gps' column
batch_data.dropna(subset=['gps'], inplace=True)

result_not_null = validator.expect_column_values_to_not_be_null("gps")
print(result_not_null)

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "gps",
      "batch_id": "smartwatch-smartwatch_prices"
    },
    "meta": {}
  },
  "result": {
    "element_count": 189,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


The gps column in this data batch has 189 elements and all of them are not null. There are no unexpected values, with the percentage of unexpected values ​​being 0.0%.