# Part 1: Great Expectations

## 1. Install Great Expectations Library


This step installs the Great Expectations library, which is an open-source framework for data quality testing. It allows you to define, validate, and document expectations for your datasets. The library helps ensure that your data is clean, consistent, and meets predefined quality standards. By installing it via pip, we can use the various functionalities of Great Expectations in the notebook.

In [None]:
!pip install great_expectations



##2. Import Necessary Libraries

In this step, we import two essential libraries:

    - pandas: A powerful data manipulation library that provides data structures (like DataFrames) to handle and analyze structured data efficiently.
    - great_expectations: The core library of the Great Expectations framework, imported as gx. This will be used to define expectations and validate data quality.

In [None]:
import pandas as pd
import great_expectations as gx

##3. Load labels.csv


In [None]:
import pandas as pd

# Load with explicit path (Colab stores uploads in /content/)
df = pd.read_csv("/content/Labels.csv")

[link text](https://)Here, we load the UCI Adult Dataset directly from a URL. This dataset contains demographic information and income data, such as age, education, hours worked per week, and income level. It is often used for machine learning tasks like classification. The pd.read_csv() function loads the data into a pandas DataFrame and assigns column names for easier access.

##4. Preview the Dataset

In this step, we use the head() function to display the first five rows of the dataset. This is a common method for quickly inspecting the data structure and ensuring the dataset has been loaded correctly.

In [None]:
df.head()

Unnamed: 0,Timestamp,Car1_Location_X,Car1_Location_Y,Car1_Location_Z,Car2_Location_X,Car2_Location_Y,Car2_Location_Z,Occluded_Image_view,Occluding_Car_view,Ground_Truth_View,pedestrianLocationX_TopLeft,pedestrianLocationY_TopLeft,pedestrianLocationX_BottomRight,pedestrianLocationY_BottomRight
0,1736796157,-51.402977,143,0.596902,-59.32027,140,0.596902,A_001.png,B_001.png,C_001.png,593,361,610,410
1,1736796167,-53.819637,143,0.596902,-59.196568,140,0.596902,A_002.png,B_002.png,C_002.png,579,368,594,415
2,1736796178,-50.239144,143,0.596902,-56.744479,140,0.596902,A_003.png,B_003.png,C_003.png,854,720,854,720
3,1736796188,-53.70722,143,0.596902,-57.30938,140,0.596902,A_004.png,B_004.png,C_004.png,549,368,567,425
4,1736796198,-52.053721,143,0.596902,-59.545897,140,0.596902,A_005.png,B_005.png,C_005.png,524,368,537,413


##5. Set Up Great Expectations Context and Data Source

Here, we initialize the Great Expectations context, which manages data sources, expectations, and validation processes. We then add a pandas data source to the context, allowing us to work with our pandas DataFrame as a data asset for further validation steps.

In [35]:
# Write code here
context = gx.get_context()
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")


INFO:great_expectations.data_context.types.base:Created temporary directory '/tmp/tmpq2uayuvq' for ephemeral docs site


##6. Define and Create a Data Batch

In this step, we define a batch of data from the pandas DataFrame (df). A batch in Great Expectations represents a subset of the data that can be validated against expectations. We create a batch definition for the whole DataFrame and then use this definition to retrieve the batch of data for validation.

In [38]:
# Write code here
batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})


##7. Define an Expectation for Column Values

Here, we define an expectation for the education-num column in the dataset. The expectation specifies that the values in the education-num column should fall within a specified range (between 0 and 20). Expectations help us define what the valid range of data values should be for each column.

In [39]:
## Original Function
expectation = gx.expectations.ExpectColumnValuesToBeBetween(
    column="column", min_value=0, max_value=20
)

## Example Function

## This function only requires a column parameter, and not a max or min value
expectation = gx.expectations.ExpectColumnValuesToBeUnique(
    column="column"
)

# Expectation 1


In [46]:
# Write code here
# Expectation 1: Check if Timestamps are Unique
expectation_1 = gx.expectations.ExpectColumnValuesToBeUnique(
    column="Timestamp"
)


## Validate Data Against Expectation 1


In [47]:
# Check if timestamps are strictly increasing (chronological order)
expectation_1 = gx.expectations.ExpectColumnValuesToBeIncreasing(
    column="Timestamp",
    strictly=True,  # No duplicate timestamps allowed
    meta={"purpose": "Ensure data is logged in correct time sequence"}
)

validation_result_1 = batch.validate(expectation_1)
print("Validation Result for Timestamp Order:\n", validation_result_1)

Calculating Metrics:   0%|          | 0/10 [00:00<?, ?it/s]

Validation Result for Timestamp Order:
 {
  "success": true,
  "expectation_config": {
    "type": "expect_column_values_to_be_increasing",
    "kwargs": {
      "batch_id": "pandas-pd dataframe asset",
      "column": "Timestamp",
      "strictly": true
    },
    "meta": {
      "purpose": "Ensure data is logged in correct time sequence"
    }
  },
  "result": {
    "element_count": 121,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_counts": [],
    "partial_unexpected_index_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


# Expectation 2


In [48]:
# Write code here
# Expectation 2: Check if Column Values Are Not Null
expectation_2 = gx.expectations.ExpectColumnValuesToNotBeNull(
    column="Car1_Location_X"
)


# Validate Data Against Expectation 2

In [54]:
# Verify Car1_Location_Y is always 143 (using value_set instead)
expectation_2 = gx.expectations.ExpectColumnValuesToBeInSet(
    column="Car1_Location_Y",
    value_set=[143],  # Single-value set
    meta={"note": "Car1 should maintain constant Y-position"}
)

validation_result_2 = batch.validate(expectation_2)
print("\nValidation Result for Car1 Y Position:\n", validation_result_2)

Calculating Metrics:   0%|          | 0/10 [00:00<?, ?it/s]


Validation Result for Car1 Y Position:
 {
  "success": true,
  "expectation_config": {
    "type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "batch_id": "pandas-pd dataframe asset",
      "column": "Car1_Location_Y",
      "value_set": [
        143
      ]
    },
    "meta": {
      "note": "Car1 should maintain constant Y-position"
    }
  },
  "result": {
    "element_count": 121,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_counts": [],
    "partial_unexpected_index_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


# Expectation 3


In [50]:
# Write code here
# Expectation 3: Check if Values Are Within a Specific Range
expectation_3 = gx.expectations.ExpectColumnValuesToBeBetween(
    column="Car1_Location_X", min_value=-100, max_value=100
)

# Validate Data Against Expectation 3

In [51]:
# Ensure bounding box width is positive (RightX > LeftX)
expectation_3 = gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
    column_A="pedestrianLocationX_BottomRight",
    column_B="pedestrianLocationX_TopLeft",
    meta={"validation": "Bounding box width must be positive"}
)

validation_result_3 = batch.validate(expectation_3)
print("\nValidation Result for Bounding Box Logic:\n", validation_result_3)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]


Validation Result for Bounding Box Logic:
 {
  "success": false,
  "expectation_config": {
    "type": "expect_column_pair_values_a_to_be_greater_than_b",
    "kwargs": {
      "batch_id": "pandas-pd dataframe asset",
      "column_A": "pedestrianLocationX_BottomRight",
      "column_B": "pedestrianLocationX_TopLeft"
    },
    "meta": {
      "validation": "Bounding box width must be positive"
    }
  },
  "result": {
    "element_count": 121,
    "unexpected_count": 3,
    "unexpected_percent": 2.479338842975207,
    "partial_unexpected_list": [
      [
        854,
        854
      ],
      [
        600,
        854
      ],
      [
        580,
        597
      ]
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 2.479338842975207,
    "unexpected_percent_nonmissing": 2.479338842975207,
    "partial_unexpected_counts": [
      {
        "value": [
          580,
          597
        ],
        "count": 1
      },
      {
        "value":