# Phase 2 - Week 2 - Day 5 AM - Data Ethics & Data Validation

In this notebook, we will perform Data Validation test with Python's Great Expectation (GX). First, we need to install Great Expectation package. We will be using [Yellow Taxi Trip Data](https://www.kaggle.com/datasets/microize/newyork-yellow-taxi-trip-data-2020-2019) as the dataset.

*Source : [url](https://github.com/great-expectations/great_expectations) and [url](https://greatexpectations.io/)*

# A. Install Great Expectation Package

In [1]:
# Install the library

!pip install -q great-expectations

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25h

The `--quiet` or `-q` parameter in the `pip install` command is used to suppress or reduce the output and make the installation process less verbose. You'll see less information about the installation progress.

# B. Instantiate Data Context

A `Data Context` is the primary entry point for a Great Expectations (GX) deployment, and **it provides the configurations and methods for all supporting GX components**.

As the primary entry point for the GX API, the Data Context provides a convenient method for accessing common objects based on untyped input or common defaults. A Data Context also allows you to configure top-level components, and you can use different storage methodologies to back up your Data Context configuration.

<img src='https://docs.greatexpectations.io/assets/images/data_context_does_for_you-df2eca32d0152ead16cccd5d3d226abb.png'>

*Please visit this [url](https://docs.greatexpectations.io/docs/terms/data_context/) for more details.*

In [2]:
# Create a data context

from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

The param `project_root_dir='./'` means the Data Context will create in current working directory. You can store the Data Context in a specific path by specifying its location in the `project_root_dir`.

# C. Connect to A `Datasource`

In Great Expectations, you must define `Datasource` and `Data Assets`.

* `Datasource` : provides a standard API for accessing and interacting with data from a wide variety of source systems.

* `Data Asset` : a collection of records within a `Datasource` which is usually named based on the underlying data system and sliced to correspond to a desired specification.

*Example if data asset more than one files : [url](https://docs.greatexpectations.io/docs/terms/batch_request)*

In [3]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-data-jan'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'trip-january'
path_to_data = 'https://raw.githubusercontent.com/FTDS-learning-materials/phase-2/master/w2/P2W2D5AM%20-%20Data%20Ethics%20%26%20Data%20Validation%20-%20Yellow%20Taxi%20Trip%20Data%20-%202019%20-%2001.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

  and should_run_async(code)



# D. Create an Expectation Suite

An `Expectation Suite` is a collection of verifiable assertions about data. **`Expectation Suites` combine multiple Expectations into an overall description of data.**

`Expectation Suite` names are customizable, and the only constraint is that it **must be unique to a given project**.

You'll use a Validator to interact with your batch of data and generate an `Expectation Suite`.

Every time you evaluate an Expectation with `validator.expect_*`, it is immediately validated against your data.

In [4]:
# Creat an expectation suite
expectation_suite_name = 'expectation-trip-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()




Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-15 03:36:12,2019-01-15 03:42:19,1,1.0,1,N,230,48,1,6.5,0.5,0.5,1.95,0.0,0.3,9.75,
1,1,2019-01-25 18:20:32,2019-01-25 18:26:55,1,0.8,1,N,112,112,1,6.0,1.0,0.5,1.55,0.0,0.3,9.35,0.0
2,1,2019-01-05 06:47:31,2019-01-05 06:52:19,1,1.1,1,N,107,4,2,6.0,0.0,0.5,0.0,0.0,0.3,6.8,
3,1,2019-01-09 15:08:02,2019-01-09 15:20:17,1,2.5,1,N,143,158,1,11.0,0.0,0.5,3.0,0.0,0.3,14.8,
4,1,2019-01-25 18:49:51,2019-01-25 18:56:44,1,0.8,1,N,246,90,1,6.5,1.0,0.5,1.65,0.0,0.3,9.95,0.0


## D.1 - Expectations

An `Expectatio`n is a verifiable assertion about source data. Similar to assertions in traditional Python unit tests, Expectations provide a flexible, declarative language for describing expected behaviors.

**Expectations is basically what do we expect from the data.**

For example we expect our columns:
- not to be empty
- to be unique
- should be between x and y
- should match with regex
- and many more

*You can see the galery of Expectations [here](https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html)*

In [5]:
# Expectation 1 : Column `pickup_datetime` can not contain missing values

validator.expect_column_values_to_not_be_null('pickup_datetime')

  and should_run_async(code)




Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "pickup_datetime",
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [6]:
# Expectation 2 : Column `congestion_surcharge` can not contain missing values


validator.expect_column_values_to_not_be_null('congestion_surcharge')

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "congestion_surcharge",
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 6327,
    "unexpected_percent": 63.27,
    "partial_unexpected_list": [
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null
    ]
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

You can see, the Expectation `expect_column_values_to_not_be_null` on column `congestion_surcharge` return `False` with `63.27 %` of the rows are missing values.

In [7]:
# Expectation 3 : Column `dropoff_datetime` must be unique

validator.expect_column_values_to_be_unique('dropoff_datetime')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "dropoff_datetime",
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 56,
    "unexpected_percent": 0.5599999999999999,
    "partial_unexpected_list": [
      "2019-01-09 16:33:12",
      "2019-01-15 08:24:26",
      "2019-01-22 11:49:18",
      "2019-01-31 17:24:49",
      "2019-01-09 18:51:15",
      "2019-01-17 11:43:05",
      "2019-01-04 12:27:51",
      "2019-01-04 16:10:54",
      "2019-01-26 17:23:28",
      "2019-01-31 17:24:49",
      "2019-01-15 07:59:52",
      "2019-01-21 12:52:46",
      "2019-01-08 12:30:32",
      "2019-01-29 16:51:46",
      "2019-01-19 00:17:45",
      "2019-01-19 20:11:26",
      "2019-01-17 11:43:05",
      "2019-01-02 18:16:25",
      "2019-01-04 06:52:16",
      "2019-01-04 16:10:54"
    ],
    "missing_count": 0,
    "mi

In [8]:
# Expectation 4 : Column `tip_amount` must be less than $ 100

validator.expect_column_values_to_be_between(
    column='tip_amount', min_value=0, max_value=100
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "tip_amount",
      "min_value": 0,
      "max_value": 100,
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [9]:
# Expectation 5 : Column `trip_distance` must be exist to calculate the amount of travel costs that must be paid

validator.expect_column_to_exist(column='trip_distance')

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "trip_distance",
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [10]:
# Expectation 6 : Column `payment_type` must contain one of the following 6 things :
# 1 = Credit card
# 2 = Cash
# 3 = No charge
# 4 = Dispute
# 5 = Unknown
# 6 = Voided trip

validator.expect_column_values_to_be_in_set('payment_type', [1, 2, 3, 4, 5, 6])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "column": "payment_type",
      "value_set": [
        1,
        2,
        3,
        4,
        5,
        6
      ],
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [11]:
# Expectation 7 : Column `total_amount` must in form of integer or float

validator.expect_column_values_to_be_in_type_list('total_amount', ['integer', 'float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_type_list",
    "kwargs": {
      "column": "total_amount",
      "type_list": [
        "integer",
        "float"
      ],
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [12]:
# Expectation 8 : The maximum value of column `mta_tax` must be `$ 0.5`

validator.expect_column_max_to_be_between('mta_tax', 0, 0.5)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_max_to_be_between",
    "kwargs": {
      "column": "mta_tax",
      "min_value": 0,
      "max_value": 0.5,
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 37.51
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [13]:
# Expectation 9 : The average of `trip_distance` must in range 0 - 5 miles per trip

validator.expect_column_mean_to_be_between('trip_distance', 0, 5)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_mean_to_be_between",
    "kwargs": {
      "column": "trip_distance",
      "min_value": 0,
      "max_value": 5,
      "batch_id": "csv-data-jan-trip-january"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 2.7589909999999995
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [14]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)

By default, Great Expectation will only store successful Expectations. Set param `discard_failed_expectation=False` so that all Expectations both successful and failed to be stored.

## D.2 - Checkpoint

A `Checkpoint` is the primary means for validating data in a production deployment of Great Expectations.

`Checkpoints` provide a convenient abstraction for **bundling the Validation of a Batch (or Batches) of data against an Expectation Suite (or several)**, as well as the Actions that should be taken after the validation.

<img src='https://docs.greatexpectations.io/assets/images/how_a_checkpoint_works-10e7fda2c9013d98a36c1d8526036764.png'>

In [15]:
# Create a checkpoint

checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [16]:
# Run a checkpoint

checkpoint_result = checkpoint_1.run()




Calculating Metrics:   0%|          | 0/36 [00:00<?, ?it/s]

## D.3 - Data Docs

`Data Docs` translate `Expectations`, `Validation Results`, and other metadata into human-readable documentation. Automatically compiling your data documentation from your data tests in the form of `Data Docs`. **Data Docs are rendered as HTML files.** As such, you can open them with any browser

In [17]:
# Build data docs

context.build_data_docs()

  and should_run_async(code)



{'local_site': 'file:///content/gx/uncommitted/data_docs/local_site/index.html'}

# E. Data Validation using Another File

In [18]:
# Connect to a data source

import great_expectations as gx

context_jan = gx.get_context(context_root_dir='./gx/')

In [19]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-data-feb'
datasource = context_jan.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'trip-february'
path_to_data = 'https://raw.githubusercontent.com/FTDS-learning-materials/phase-2/master/w2/P2W2D5AM%20-%20Data%20Ethics%20%26%20Data%20Validation%20-%20Yellow%20Taxi%20Trip%20Data%20-%202019%20-%2002.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request_feb = asset.build_batch_request()

In [20]:
# Create a checkpoint

checkpoint_2 = context_jan.add_or_update_checkpoint(
    name = 'checkpoint_2',
    batch_request = batch_request_feb,
    expectation_suite_name = expectation_suite_name
)

checkpoint_result = checkpoint_2.run()




Calculating Metrics:   0%|          | 0/36 [00:00<?, ?it/s]

In [21]:
# Build data docs

context.build_data_docs()

  and should_run_async(code)



{'local_site': 'file:///content/gx/uncommitted/data_docs/local_site/index.html'}