In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("bank_data.csv", delimiter=";" )

<a id="section-setup"></a>
## Selecting the dataframe as a reference for the expectations


First seven months will be used to build our expectations. The remaining ones wil be used as batch data to be exposed to our expectations.

In [3]:
df_ref = df[df["month"].isin(['jan', 'feb','mar', 'apr','may', 'jun', 'jul'])]

In [4]:
df_ana = df[df["month"].isin(['aug', 'sep','oct', 'nov', 'dec'])]

In [5]:
# run this cell for this notebook. There will be a warning about dependencies with hopsworks.
# If you need you can downgraade it again for hopsworks
#!pip install great-expectations==0.17.23 pandas==2.1.4
#pip install --upgrade great_expectations


In [7]:
import great_expectations as gx
from great_expectations.core.expectation_configuration import ExpectationConfiguration

<a id="section-setup"></a>
## Setting up your own project

To initialize your own project, run `great_expectations init` and follow the instructions in a terminal window. 

<img src="figures/gx_init.png" width=800px>

Once you created your suite using `great_expectations suite new`, you can use the `great_expectations suite edit` command to open up an auto-generated notebook that you can use to set up your suite. But we will try to create one similar during this session and compare the results.

The [getting started guide](https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started.html) can  help you along the way. For ideas on how Great Expectation can fit into your workflow, check out [Deployment patterns](https://docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html#deployment-patterns).

You should see a folder with the follwoing organization:

In [8]:
!tree gx

Folder PATH listing for volume OS
Volume serial number is CC7A-27AF
C:\USERS\HENRI\ONEDRIVE\DESKTOP\PPG\MLOPS\LAB 1 - DATA UNIT TESTS\GX
+---checkpoints
+---expectations
+---plugins
¦   +---custom_data_docs
¦       +---renderers
¦       +---styles
¦       +---views
+---profilers
+---uncommitted
    +---data_docs
    ¦   +---local_site
    ¦       +---expectations
    ¦       +---static
    ¦       ¦   +---fonts
    ¦       ¦   ¦   +---HKGrotesk
    ¦       ¦   +---images
    ¦       ¦   +---styles
    ¦       +---validations
    ¦           +---Bank
    ¦               +---__none__
    ¦                   +---20240616T001059.201522Z
    ¦                   +---20240616T001433.341482Z
    ¦                   +---20240616T002518.562361Z
    ¦                   +---20240616T002519.781537Z
    +---validations
        +---Bank
            +---__none__
                +---20240616T001059.201522Z
                +---20240616T001433.341482Z
                +---20240616T002518.562361Z
         

<a id="section-setup"></a>
## Create our Data Context

The data context will have all the informat relevant to reference the batch datasets that we will test with the created expectations.
In our Data Context, a data source is something that can provide data to Great Expectations, such as an SQL database.
A data asset is one dataset that lives in a data source, such as an SQL table.

In the configuration we provided, there is one data source named data_dir, which is just a folder with csv files inside. the csv file we are working with would be a data asset. More information on data sources can be found in the data context reference. For configuring your own, refer to the configuring datasources guides.

A validation operator specifies what should be done with your validation results. Some examples could be writing the validation results to a database, publishing data docs, or sending a notification to a slack channel.

Stores can be used to configure how expectation and validation data will be stored.
These are all configured in the great_expectations.yml file, but it is not mandatory to use one.

The diagram below shows a representation of our data context.
<img src="figures/gx_structure.png" width=800px>


We start by create a contex, with the reference of the preivously created folder where all info will be saved.

In [9]:
context = gx.DataContext("gx") 
#context = gx.get_context(context_root_dir ="gx")

Next step is to define the type of our data source and add it to the context. It can Pandas or Spark as well

In [10]:
datasource_name = "bank_datasource"
try:
    datasource = context.sources.add_pandas(datasource_name)
except:
    print("Data Source already exists.")
    datasource = context.datasources[datasource_name]

Data Source already exists.


In [11]:
print("data source", context.list_datasources())

data source [{'type': 'pandas', 'name': 'bank_datasource', 'assets': [{'name': 'Bank Aug', 'type': 'dataframe', 'batch_metadata': {}}]}]


In [12]:
print(datasource)

assets:
  - batch_metadata: {}
    name: Bank Aug
    type: dataframe
name: bank_datasource
type: pandas



Finally we create our suite. The suite will be a tailored group of expectations that will be aplied to the new batches of data and see if the quality of the new data is in agreement to what was observed in the reference period. It will be a `dict` representation that Great Expectations uses under the hood to keep track of our exepectation suite. This representation can then be saved to a file, so that we can load it again at another time, without depending on the python code that produced it.

In [13]:
suite_bank = context.add_or_update_expectation_suite(expectation_suite_name="Bank")

To start creating our expectations, we need first to explore the data form the reference dataset.

In [14]:
df_ref.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'y'],
      dtype='object')

Let's start by choosing a column to creat an expectation

In [15]:
df_ref["marital"].describe()

count       33463
unique          3
top       married
freq        19695
Name: marital, dtype: object

In [16]:
df_ref["marital"].unique()

array(['married', 'single', 'divorced'], dtype=object)

There are only three distinct categories for this feature, which means that we expect the same beahviour for the data. If a new category appears we want to be alerted. 

In [17]:
expectation_marital = ExpectationConfiguration(
    expectation_type="expect_column_distinct_values_to_be_in_set",
    kwargs={
        "column": "marital",
        "value_set" : ['married', 'single', 'divorced']
    },
)
suite_bank.add_expectation(expectation_configuration=expectation_marital)

{"expectation_type": "expect_column_distinct_values_to_be_in_set", "kwargs": {"column": "marital", "value_set": ["married", "single", "divorced"]}, "meta": {}}

After adding the expectation to our suit we save the current state of context with the new version of the suite.

In [18]:
context.save_expectation_suite(expectation_suite=suite_bank)

'C:\\Users\\henri\\OneDrive\\Desktop\\PPG\\MLOps\\Lab 1 - Data Unit tests\\gx\\expectations/Bank.json'

The following step is to define our data asset, in this case will be the data from August create the batch request.

In [19]:
data_asset_name = "Bank Aug"
try:
    data_asset = datasource.add_dataframe_asset(name=data_asset_name, dataframe= df_ana[df_ana["month"]=="aug"])
except:
    print("The data asset alread exists. The required one will be loaded.")
    data_asset = datasource.get_asset(data_asset_name)

The data asset alread exists. The required one will be loaded.


In [20]:
batch_request = data_asset.build_batch_request(dataframe= df_ana[df_ana["month"]=="aug"])

Validiating your data against an expectation suite is done by running a **validation operator**. A validation operator describes what should be done with your validation results.

For running a validation we need:
- A *validation operator* to handle the validation results
- A list of *batches*, each consisting of
    - A batch of data to check
    - expectation suites to check against
    
We can include this validation in a simple checkpoint of the data, such that our validation pipeline can be sequence of several layer checkpoints with different actions, depending on the results of the data validation.

<img src="figures/checkpoint.png" width=600px>

In [21]:
checkpoint = gx.checkpoint.SimpleCheckpoint(
    name="checkpoint_marital",
    data_context=context,
    validations=[
        {
            "batch_request": batch_request,
            "expectation_suite_name": "Bank",
        },
    ],
)
checkpoint_result = checkpoint.run()


Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

After running our checkpoint, we add it to our context.

In [22]:
checkpoint_result

{
  "run_id": {
    "run_name": null,
    "run_time": "2024-06-16T01:48:18.997923+01:00"
  },
  "run_results": {
    "ValidationResultIdentifier::Bank/__none__/20240616T004818.997923Z/bank_datasource-Bank Aug": {
      "validation_result": {
        "success": true,
        "results": [
          {
            "success": true,
            "expectation_config": {
              "expectation_type": "expect_column_distinct_values_to_be_in_set",
              "kwargs": {
                "column": "marital",
                "value_set": [
                  "married",
                  "single",
                  "divorced"
                ],
                "batch_id": "bank_datasource-Bank Aug"
              },
              "meta": {}
            },
            "result": {
              "observed_value": [
                "divorced",
                "married",
                "single"
              ],
              "details": {
                "value_counts": [
                  {
        

In [23]:
context.add_checkpoint(checkpoint=checkpoint)

CheckpointError: A Checkpoint named checkpoint_marital already exists.

The results of the checkpoint are saved in a dictionary. We should look for the sucess field to see the final outcome, and detail fields to check so data statistics related with the expectation present in the suite.

We can work the layout, just to have a more friendly output and check the results of the applied suite.

In [24]:
def get_validation_results(checkpoint_result):
    # validation_result is a dictionary containing one key-value pair
    validation_result_key, validation_result_data = next(iter(checkpoint_result["run_results"].items()))

    # Accessing the 'actions_results' from the validation_result_data
    validation_result_ = validation_result_data.get('validation_result', {})

    # Accessing the 'results' from the validation_result_data
    results = validation_result_["results"]
    meta = validation_result_["meta"]
    use_case = meta.get('expectation_suite_name')
    
    
    df_validation = pd.DataFrame({},columns=["Success","Expectation Type","Column","Column Pair","Max Value",\
                                       "Min Value","Element Count","Unexpected Count","Unexpected Percent","Value Set","Unexpected Value","Observed Value"])
    
    
    for result in results:
        # Process each result dictionary as needed
        success = result.get('success', '')
        expectation_type = result.get('expectation_config', {}).get('expectation_type', '')
        column = result.get('expectation_config', {}).get('kwargs', {}).get('column', '')
        column_A = result.get('expectation_config', {}).get('kwargs', {}).get('column_A', '')
        column_B = result.get('expectation_config', {}).get('kwargs', {}).get('column_B', '')
        value_set = result.get('expectation_config', {}).get('kwargs', {}).get('value_set', '')
        max_value = result.get('expectation_config', {}).get('kwargs', {}).get('max_value', '')
        min_value = result.get('expectation_config', {}).get('kwargs', {}).get('min_value', '')

        element_count = result.get('result', {}).get('element_count', '')
        unexpected_count = result.get('result', {}).get('unexpected_count', '')
        unexpected_percent = result.get('result', {}).get('unexpected_percent', '')
        observed_value = result.get('result', {}).get('observed_value', '')
        if type(observed_value) is list:
            #sometimes observed_vaue is not iterable
            unexpected_value = [item for item in observed_value if item not in value_set]
        else:
            unexpected_value=[]
        
        df_validation = pd.concat([df_validation, pd.DataFrame.from_dict( [{"Success" :success,"Expectation Type" :expectation_type,"Column" : column,"Column Pair" : (column_A,column_B),"Max Value" :max_value,\
                                           "Min Value" :min_value,"Element Count" :element_count,"Unexpected Count" :unexpected_count,"Unexpected Percent":unexpected_percent,\
                                                  "Value Set" : value_set,"Unexpected Value" :unexpected_value ,"Observed Value" :observed_value}])], ignore_index=True)
        
    return df_validation

In [25]:
df_validation = get_validation_results(checkpoint_result)

In [26]:
df_validation

Unnamed: 0,Success,Expectation Type,Column,Column Pair,Max Value,Min Value,Element Count,Unexpected Count,Unexpected Percent,Value Set,Unexpected Value,Observed Value
0,True,expect_column_distinct_values_to_be_in_set,marital,"(, )",,,,,,"[married, single, divorced]",[],"[divorced, married, single]"


We can do the same exercise for a different column. Let's choose a numerical one and explore it.

In [27]:
df_ref.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [28]:
df_ref["age"].describe()

count    33463.000000
mean        40.253444
std         10.315007
min         18.000000
25%         32.000000
50%         38.000000
75%         47.000000
max         95.000000
Name: age, dtype: float64

In [29]:
df_ref["balance"].describe()

count     33463.000000
mean       1176.354182
std        2648.034750
min       -8019.000000
25%          60.000000
50%         402.000000
75%        1255.000000
max      102127.000000
Name: balance, dtype: float64

We are going to add a expection to the suite based on the distribuition of the values.

In [30]:
expectation_age = ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_between",
    kwargs={
        "column": "age",
        "max_value": 100,
        "min_value": 18
    },
)
suite_bank.add_expectation(expectation_configuration=expectation_age)

{"expectation_type": "expect_column_values_to_be_between", "kwargs": {"column": "age", "max_value": 100, "min_value": 18}, "meta": {}}

The same can be done for the balance, neverthless let's create an expectation that most likely will block the data, since the malance can contain negative values!

In [31]:
expectation_balance = ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_between",
    kwargs={
        "column": "balance",
        "max_value": 105000,
        "min_value": 0
    },
)
suite_bank.add_expectation(expectation_configuration=expectation_balance)

{"expectation_type": "expect_column_values_to_be_between", "kwargs": {"column": "balance", "max_value": 105000, "min_value": 0}, "meta": {}}

In [32]:
print(suite_bank.show_expectations_by_expectation_type())

[ { 'expect_column_distinct_values_to_be_in_set': { 'column': 'marital',
                                                    'domain': 'column',
                                                    'value_set': [ 'married',
                                                                   'single',
                                                                   'divorced']}},
  { 'expect_column_values_to_be_between': { 'column': 'age',
                                            'domain': 'column',
                                            'max_value': 100,
                                            'min_value': 18}},
  { 'expect_column_values_to_be_between': { 'column': 'balance',
                                            'domain': 'column',
                                            'max_value': 105000,
                                            'min_value': 0}}]
None


In [33]:
context.save_expectation_suite(expectation_suite=suite_bank)

'C:\\Users\\henri\\OneDrive\\Desktop\\PPG\\MLOps\\Lab 1 - Data Unit tests\\gx\\expectations/Bank.json'

In [34]:
checkpoint = gx.checkpoint.SimpleCheckpoint(
    name="checkpoint_extra",
    data_context=context,
    validations=[
        {
            "batch_request": batch_request,
            "expectation_suite_name": "Bank",
        },
    ],
)
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/18 [00:00<?, ?it/s]

In [35]:
df_validation = get_validation_results(checkpoint_result)

In [36]:
df_validation

Unnamed: 0,Success,Expectation Type,Column,Column Pair,Max Value,Min Value,Element Count,Unexpected Count,Unexpected Percent,Value Set,Unexpected Value,Observed Value
0,True,expect_column_distinct_values_to_be_in_set,marital,"(, )",,,,,,"[married, single, divorced]",[],"[divorced, married, single]"
1,True,expect_column_values_to_be_between,age,"(, )",100.0,18.0,6247.0,0.0,0.0,,[],
2,False,expect_column_values_to_be_between,balance,"(, )",105000.0,0.0,6247.0,298.0,4.77029,,[],


As expected the for the column of balance the expectations failed.

<a id="section-setup"></a>
## Industrialized suites

Unttil this moment, we have generated the expectations manully by doing an anlysis column by column and the adding it to the created suite. Henceforth, we will try to create an automatic process of automatically doing a bsic profiling on the reference dataset and then creating a suite based on the generated analysis. 

To accomplish that, we are going to used a package named **ydata_profiling** from https://docs.profiling.ydata.ai . This is the upgrade of the old pandas profiling to also include functionalites of Spark.

In [37]:
#!pip install ydata_profiling==4.7.0

In [38]:
from ydata_profiling import ProfileReport

In [39]:
from ydata_profiling.config import Settings
from ydata_profiling.model import BaseDescription, expectation_algorithms
from ydata_profiling.model.handler import Handler
from ydata_profiling.utils.dataframe import slugify

Generation of a report based on the dataframe reference:

In [40]:
profile = ProfileReport(df_ref, title=f"Bank Profiling Report", minimal=True)

In [41]:
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Then we shall create a class that transforms the outputs from the profile into suites of the Great Expectations.

In [42]:
from typing import Any, Optional

import pandas as pd
from visions import VisionsTypeset

from ydata_profiling.config import Settings
from ydata_profiling.model import BaseDescription, expectation_algorithms
from ydata_profiling.model.handler import Handler
from ydata_profiling.utils.dataframe import slugify
from great_expectations.checkpoint import SimpleCheckpoint

class ExpectationHandler(Handler):
    """Default handler"""

    def __init__(self, typeset: VisionsTypeset, *args, **kwargs):
        mapping = {
            "Unsupported": [expectation_algorithms.generic_expectations],
            "Text": [expectation_algorithms.categorical_expectations],
            "Categorical": [expectation_algorithms.categorical_expectations],
            "Boolean": [expectation_algorithms.categorical_expectations],
            "Numeric": [expectation_algorithms.numeric_expectations],
            "URL": [expectation_algorithms.url_expectations],
            "File": [expectation_algorithms.file_expectations],
            "Path": [expectation_algorithms.path_expectations],
            "DateTime": [expectation_algorithms.datetime_expectations],
            "Image": [expectation_algorithms.image_expectations],
        }
        super().__init__(mapping, typeset, *args, **kwargs)


class ExpectationsReportV3:
    config: Settings
    df: Optional[pd.DataFrame] = None

    @property
    def typeset(self) -> Optional[VisionsTypeset]:
        return None

    def to_expectation_suite(
        self,
        datasource_name,
        data_asset_name,
        suite_name: Optional[str] = None,
        data_context: Optional[Any] = None,
        save_suite: bool = True,
        run_validation: bool = True,
        build_data_docs: bool = True,
        handler: Optional[Handler] = None,
    ) -> Any:
        """
        All parameters default to True to make it easier to access the full functionality of Great Expectations out of
        the box.
        Args:
            suite_name: The name of your expectation suite
            data_context: A user-specified data context
            save_suite: Boolean to determine whether to save the suite to .json as part of the method
            run_validation: Boolean to determine whether to run validation as part of the method
            build_data_docs: Boolean to determine whether to build data docs, save the .html file, and open data docs in
                your browser
            handler: The handler to use for building expectation

        Returns:
            An ExpectationSuite
        """
        try:
            import great_expectations as ge
        except ImportError as ex:
            raise ImportError(
                "Please install great expectations before using the expectation functionality"
            ) from ex

        # Use report title if suite is empty
        if suite_name is None:
            suite_name = slugify(self.config.title)

        # Use the default handler if none
        if handler is None:
            handler = ExpectationHandler(self.typeset)

        # Obtain the ge context and create the expectation suite
        if not data_context:
            data_context = ge.data_context.DataContext()

        data_asset = data_context.get_datasource(datasource_name).get_asset(data_asset_name)

        batch_request = data_asset.build_batch_request()

        suite = data_context.add_or_update_expectation_suite(expectation_suite_name=suite_name)


        # Instantiate an in-memory pandas dataset
        validator = data_context.get_validator(batch_request=batch_request, expectation_suite=suite)

        # Obtain the profiling summary
        summary: BaseDescription = self.get_description()  # type: ignore

        # Dispatch to expectations per semantic variable type
        for name, variable_summary in summary.variables.items():
            handler.handle(variable_summary["type"], name, variable_summary, validator)

        # We don't actually update the suite object on the batch in place, so need
        # to get the populated suite from the batch
        suite = validator.get_expectation_suite(discard_failed_expectations=False)
        data_context.update_expectation_suite(suite)
        
        validation_result_identifier = None
        if run_validation:
            checkpoint_config = {
                "class_name": "SimpleCheckpoint",
                "validations": [
                    {
                        "batch_request": batch_request,
                        "expectation_suite_name": suite_name,
                    }
                ]
            }
            checkpoint = SimpleCheckpoint(
                f"_tmp_checkpoint_{suite_name}",
                data_context,
                suite,
                **checkpoint_config,

            )
            results = checkpoint.run(result_format="SUMMARY", run_name=suite_name)
            validation_result_identifier = results.list_validation_result_identifiers()[0]



        # Write expectations and open data docs
        if save_suite or build_data_docs:
            data_context.update_expectation_suite(suite)

        if build_data_docs:
            data_context.build_data_docs()


        return validator.get_expectation_suite()

In [43]:
from ydata_profiling.expectations_report import ExpectationsReport

We adjust the Class of Ydata to convert the profile to expectations:

In [44]:
ExpectationsReport.to_expectation_suite = ExpectationsReportV3.to_expectation_suite

In [45]:
datasource_name = "bank_datasource"

Then it is possible to create a new suite to the Bank data:

In [46]:
new_suite_bank = profile.to_expectation_suite(
    datasource_name = datasource_name,
    data_asset_name="Bank Aug",
    suite_name = 'Bank New',
    data_context=context
)

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/173 [00:00<?, ?it/s]

Save the context with the new suite.

In [47]:
context.save_expectation_suite(expectation_suite=new_suite_bank)

'C:\\Users\\henri\\OneDrive\\Desktop\\PPG\\MLOps\\Lab 1 - Data Unit tests\\gx\\expectations/Bank New.json'

Create a new checkpint to the data. 

In [48]:
checkpoint = gx.checkpoint.SimpleCheckpoint(
    name="checkpoint_full",
    data_context=context,
    validations=[
        {
            "batch_request": batch_request,
            "expectation_suite_name": "Bank New",
        },
    ],
)
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/163 [00:00<?, ?it/s]

In [49]:
df_validation = get_validation_results(checkpoint_result)

In [51]:
df_validation.head(100)

Unnamed: 0,Success,Expectation Type,Column,Column Pair,Max Value,Min Value,Element Count,Unexpected Count,Unexpected Percent,Value Set,Unexpected Value,Observed Value
0,True,expect_column_to_exist,age,"(, )",,,,,,,[],
1,True,expect_column_values_to_not_be_null,age,"(, )",,,6247.0,0.0,0.0,,[],
2,True,expect_column_values_to_be_in_type_list,age,"(, )",,,,,,,[],int64
3,True,expect_column_values_to_be_between,age,"(, )",95.0,18.0,6247.0,0.0,0.0,,[],
4,True,expect_column_to_exist,job,"(, )",,,,,,,[],
5,True,expect_column_values_to_not_be_null,job,"(, )",,,6247.0,0.0,0.0,,[],
6,True,expect_column_values_to_be_in_set,job,"(, )",,,6247.0,0.0,0.0,"[housemaid, management, student, services, blu...",[],
7,True,expect_column_to_exist,marital,"(, )",,,,,,,[],
8,True,expect_column_values_to_not_be_null,marital,"(, )",,,6247.0,0.0,0.0,,[],
9,True,expect_column_values_to_be_in_set,marital,"(, )",,,6247.0,0.0,0.0,"[married, divorced, single]",[],


In [52]:
df_validation[df_validation["Success"]==False]

Unnamed: 0,Success,Expectation Type,Column,Column Pair,Max Value,Min Value,Element Count,Unexpected Count,Unexpected Percent,Value Set,Unexpected Value,Observed Value
