# Part 1: Great Expectations

## 1. Install Great Expectations Library


This step installs the Great Expectations library, which is an open-source framework for data quality testing. It allows you to define, validate, and document expectations for your datasets. The library helps ensure that your data is clean, consistent, and meets predefined quality standards. By installing it via pip, we can use the various functionalities of Great Expectations in the notebook.

In [1]:
!pip install great_expectations



##2. Import Necessary Libraries

In this step, we import two essential libraries:

    - pandas: A powerful data manipulation library that provides data structures (like DataFrames) to handle and analyze structured data efficiently.
    - great_expectations: The core library of the Great Expectations framework, imported as gx. This will be used to define expectations and validate data quality.

In [2]:
import pandas as pd
import great_expectations as gx

##3. Load UCI Adult Dataset

Here, we load the UCI Adult Dataset directly from a URL. This dataset contains demographic information and income data, such as age, education, hours worked per week, and income level. It is often used for machine learning tasks like classification. The pd.read_csv() function loads the data into a pandas DataFrame and assigns column names for easier access.

In [20]:
df = pd.read_csv("https://raw.githubusercontent.com/zubxxr/SOFE3980U-Lab5/refs/heads/main/Labels.csv",
                 names = [
    "Timestamp", "Car1_Location_X", "Car1_Location_Y", "Car1_Location_Z",
    "Car2_Location_X", "Car2_Location_Y", "Car2_Location_Z",
    "Occluded_Image_View", "Occluding_Car_View", "Ground_Truth_View",
    "pedestrianLocationX_TopLeft", "pedestrianLocationY_TopLeft",
    "pedestrianLocationX_BottomRight", "pedestrianLocationY_BottomRight"
])

##4. Preview the Dataset

In this step, we use the head() function to display the first five rows of the dataset. This is a common method for quickly inspecting the data structure and ensuring the dataset has been loaded correctly.

In [21]:
df.head()

Unnamed: 0,Timestamp,Car1_Location_X,Car1_Location_Y,Car1_Location_Z,Car2_Location_X,Car2_Location_Y,Car2_Location_Z,Occluded_Image_View,Occluding_Car_View,Ground_Truth_View,pedestrianLocationX_TopLeft,pedestrianLocationY_TopLeft,pedestrianLocationX_BottomRight,pedestrianLocationY_BottomRight
0,Timestamp,Car1_Location_X,Car1_Location_Y,Car1_Location_Z,Car2_Location_X,Car2_Location_Y,Car2_Location_Z,Occluded_Image_view,Occluding_Car_view,Ground_Truth_View,pedestrianLocationX_TopLeft,pedestrianLocationY_TopLeft,pedestrianLocationX_BottomRight,pedestrianLocationY_BottomRight
1,1736796157,-51.40297655,143,0.596902,-59.32026969,140,0.596902,A_001.png,B_001.png,C_001.png,593,361,610,410
2,1736796167,-53.81963722,143,0.596902,-59.19656815,140,0.596902,A_002.png,B_002.png,C_002.png,579,368,594,415
3,1736796178,-50.23914439,143,0.596902,-56.74447887,140,0.596902,A_003.png,B_003.png,C_003.png,854,720,854,720
4,1736796188,-53.70722021,143,0.596902,-57.30938047,140,0.596902,A_004.png,B_004.png,C_004.png,549,368,567,425


##5. Set Up Great Expectations Context and Data Source

Here, we initialize the Great Expectations context, which manages data sources, expectations, and validation processes. We then add a pandas data source to the context, allowing us to work with our pandas DataFrame as a data asset for further validation steps.

In [22]:
context = gx.get_context()
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")

INFO:great_expectations.data_context.types.base:Created temporary directory '/tmp/tmp1ms4xes6' for ephemeral docs site


##6. Define and Create a Data Batch

In this step, we define a batch of data from the pandas DataFrame (df). A batch in Great Expectations represents a subset of the data that can be validated against expectations. We create a batch definition for the whole DataFrame and then use this definition to retrieve the batch of data for validation.

In [23]:
batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})

##7. Expect Column Median to Be Between (pedestrianLocationX_TopLeft) (1)

The median value of pedestrianLocationX_TopLeft should fall within a reasonable range (e.g., between 500 and 900).

In [32]:
expectation_1 = gx.expectations.ExpectColumnMedianToBeBetween(
    column="pedestrianLocationX_TopLeft", min_value=500, max_value=900
)


##8. Validate the Data Against the Expectation





In this final step, we validate the dataset (batch) against the previously defined expectation.

In [33]:
validation_result_1 = batch.validate(expectation_1)
print(validation_result_1)


Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": false,
  "expectation_config": {
    "type": "expect_column_median_to_be_between",
    "kwargs": {
      "column": "pedestrianLocationX_TopLeft",
      "min_value": 500.0,
      "max_value": 900.0,
      "batch_id": "pandas-pd dataframe asset"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "MetricConfigurationID(metric_name='column.median', metric_domain_kwargs_id='5bf439273a76ec865495c0580ddfed4e', metric_value_kwargs_id=())": {
      "exception_traceback": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.11/dist-packages/great_expectations/execution_engine/execution_engine.py\", line 534, in _process_direct_and_bundled_metric_computation_configurations\n    metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/usr/local/lib/python3.11/dist-packages/great_expectations/expectations/metrics/metric_provider.py\", line 60, in inner_func\n    return metric_fn(*args, **kwargs)\n   

##9. Expect Column Values to Match a Pattern List (Ground_Truth_View) (2)

The Ground_Truth_View column should follow a consistent filename pattern (e.g., "C_###.png").

In [40]:
expectation_2 = gx.expectations.ExpectColumnValuesToMatchLikePatternList(
    column="Ground_Truth_View", like_pattern_list=["C_%.png"]
)



##10. Validate the Data Against the Expectation

In [42]:
validation_result_2 = batch.validate(expectation_2)
print(validation_result_2)



Calculating Metrics: 0it [00:00, ?it/s]

{
  "success": false,
  "expectation_config": {
    "type": "expect_column_values_to_match_like_pattern_list",
    "kwargs": {
      "column": "Ground_Truth_View",
      "like_pattern_list": [
        "C_%.png"
      ],
      "batch_id": "pandas-pd dataframe asset"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "exception_traceback": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.11/dist-packages/great_expectations/expectations/registry.py\", line 315, in get_metric_provider\n    return metric_definition[\"providers\"][type(execution_engine).__name__]\n           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nKeyError: 'PandasExecutionEngine'\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.11/dist-packages/great_expectations/validator/validator.py\", line 714, in _generate_metric_dependency_subgraphs_for_each_expectati

##11. Expect Table Columns to Match a Set (3)

The dataset should contain a fixed set of expected columns, ensuring data structure consistency.

In [38]:
expected_columns = [
    "Timestamp", "Car1_Location_X", "Car1_Location_Y", "Car1_Location_Z",
    "Car2_Location_X", "Car2_Location_Y", "Car2_Location_Z",
    "Occluded_Image_View", "Occluding_Car_View", "Ground_Truth_View",
    "pedestrianLocationX_TopLeft", "pedestrianLocationY_TopLeft"
]

expectation_3 = gx.expectations.ExpectTableColumnsToMatchSet(column_set=expected_columns)


##12. Validate the Data Against the Expectation

In [39]:
validation_result_3 = batch.validate(expectation_3)
print(validation_result_3)


Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": false,
  "expectation_config": {
    "type": "expect_table_columns_to_match_set",
    "kwargs": {
      "batch_id": "pandas-pd dataframe asset",
      "column_set": [
        "Timestamp",
        "Car1_Location_X",
        "Car1_Location_Y",
        "Car1_Location_Z",
        "Car2_Location_X",
        "Car2_Location_Y",
        "Car2_Location_Z",
        "Occluded_Image_View",
        "Occluding_Car_View",
        "Ground_Truth_View",
        "pedestrianLocationX_TopLeft",
        "pedestrianLocationY_TopLeft"
      ]
    },
    "meta": {}
  },
  "result": {
    "observed_value": [
      "Car1_Location_X",
      "Car1_Location_Y",
      "Car1_Location_Z",
      "Car2_Location_X",
      "Car2_Location_Y",
      "Car2_Location_Z",
      "Ground_Truth_View",
      "Occluded_Image_View",
      "Occluding_Car_View",
      "Timestamp",
      "pedestrianLocationX_BottomRight",
      "pedestrianLocationX_TopLeft",
      "pedestrianLocationY_BottomRight",
      "pedestrianLocati