One thing to note: when I installed great_expectations through Conda it installed a very old version. If you run into any issues with this notebook check the version of your installation. After I specified the version to be =0.15.2 for conda, it installed the newer version 

In [2]:
import great_expectations as ge
import pandas as pd

from great_expectations.dataset import (
    PandasDataset,
    MetaPandasDataset
)

Load dataset containing Basel data from MeteoBlue into pandas dataframe

In [3]:
df = pd.read_csv("data/weather_basel.csv")
df

Unnamed: 0,timestamp,temperature,relative_humidity,wind_speed,wind_direction
0,20160406T0000,11.710529,97.0,4.680000,270.000000
1,20160406T0100,11.180529,92.0,18.837322,296.075350
2,20160406T0200,10.690529,94.0,8.942214,310.100920
3,20160406T0300,10.330529,92.0,10.446206,271.974900
4,20160406T0400,9.890529,89.0,14.044615,271.468800
...,...,...,...,...,...
53253,20220503T1900,20.170528,36.0,7.568566,25.346160
53254,20220503T2000,19.360529,38.0,5.483357,23.198593
53255,20220503T2100,18.550530,42.0,5.241679,15.945404
53256,20220503T2200,16.950530,52.0,3.096837,35.537660


Create great_expectations PandasDataset which is a Pandas DataFrame but with additional methods from Great Expectations

In [4]:
ge_df = ge.from_pandas(df)
ge_df

Unnamed: 0,timestamp,temperature,relative_humidity,wind_speed,wind_direction
0,20160406T0000,11.710529,97.0,4.680000,270.000000
1,20160406T0100,11.180529,92.0,18.837322,296.075350
2,20160406T0200,10.690529,94.0,8.942214,310.100920
3,20160406T0300,10.330529,92.0,10.446206,271.974900
4,20160406T0400,9.890529,89.0,14.044615,271.468800
...,...,...,...,...,...
53253,20220503T1900,20.170528,36.0,7.568566,25.346160
53254,20220503T2000,19.360529,38.0,5.483357,23.198593
53255,20220503T2100,18.550530,42.0,5.241679,15.945404
53256,20220503T2200,16.950530,52.0,3.096837,35.537660


An expectation is like a unit test but for data

In [5]:
ge_df.expect_column_values_to_be_between("temperature", min_value=-20, max_value=50, result_format="COMPLETE")

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 53258,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 1,
    "unexpected_percent": 0.0018776521837094895,
    "unexpected_percent_total": 0.0018776521837094895,
    "unexpected_percent_nonmissing": 0.0018776521837094895,
    "partial_unexpected_list": [
      56005287.0
    ],
    "partial_unexpected_index_list": [
      15776
    ],
    "partial_unexpected_counts": [
      {
        "value": 56005287.0,
        "count": 1
      }
    ],
    "unexpected_list": [
      56005287.0
    ],
    "unexpected_index_list": [
      15776
    ]
  },
  "success": false
}

In [6]:
ge_df.iloc[15776]

timestamp                 Hello
temperature          56005287.0
relative_humidity          94.0
wind_speed             8.557102
wind_direction        255.37914
Name: 15776, dtype: object

Expectations can have many parameters

In [7]:
ge_df.expect_column_values_to_be_between("relative_humidity", min_value=0, max_value=100, strict_min=True, strict_max=True)

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 53258,
    "missing_count": 1,
    "missing_percent": 0.0018776521837094895,
    "unexpected_count": 22,
    "unexpected_percent": 0.041309123683271685,
    "unexpected_percent_total": 0.041308348041608774,
    "unexpected_percent_nonmissing": 0.041309123683271685,
    "partial_unexpected_list": [
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0,
      100.0
    ]
  },
  "success": false
}

All these expectations are called ColumnMapExpectations since they map a certain requirement to every single element in the column

In [8]:
ge_df.expect_column_values_to_not_be_null("relative_humidity", result_format="COMPLETE")

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 53258,
    "unexpected_count": 1,
    "unexpected_percent": 0.0018776521837094895,
    "unexpected_percent_total": 0.0018776521837094895,
    "partial_unexpected_list": [],
    "unexpected_list": [
      null
    ],
    "unexpected_index_list": [
      31534
    ]
  },
  "success": false
}

We can also target expectations at the table itself, not just single columns. This is called a table expectation

In [9]:
ge_df.expect_column_to_exist("relative_humidity")

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {},
  "success": true
}

And there are also expectations that target a whole column at a time. These are calles ColumnAggregateExpectation since they aggregate the whole column

In [10]:
ge_df.expect_column_stdev_to_be_between("wind_speed", 0, 5)

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "observed_value": 8.526439457629214,
    "element_count": 53258,
    "missing_count": null,
    "missing_percent": null
  },
  "success": false
}

To create new expectations in this notebook example, we will use a custom class inheriting from PandasDataset.  
  
For column_map_expectations the argument column is a Pandas.Series (of the specified column)  
For column_aggregate_expectations, the argument is the string passed into the expectation. To access the specific column use self[column]

In [11]:
class MyCustomPandasDataset(PandasDataset):
    _data_asset_type = "MyCustomPandasDataset"
    
    # 
    @MetaPandasDataset.column_map_expectation
    def expect_column_longest_interval_of_same_values_to_be_under(self, column, max_value, strict_max=False):
        s = column.diff().abs().cumsum()
        value_counts = s.value_counts()
        output = s.map(lambda x: True)
        for value in value_counts.index:
            if value_counts[value] >= max_value:
                if not strict_max or value_counts[value] > max_value:
                    index = s[s==value].index[0]
                    output[index] = False
        return output

    @MetaPandasDataset.column_map_expectation
    def expect_column_largest_diff_to_be_under(self, column, max_value, strict_max=False):
        s = column.diff().fillna(0)

        if max_value is not None:
            if strict_max:
                return s.map(lambda x: x < max_value)
            else:
                return s.map(lambda x: x <= max_value)
        else:
            return s.map(lambda x: True)
        
    @MetaPandasDataset.column_map_expectation
    def expect_column_value_z_scores_to_be_less_than(self, column, lower_threshold, upper_threshold):
        mean = column.mean()
        std_dev = column.std()
        under_upper =  (column - mean) / std_dev <= upper_threshold 
        over_lower = (column - mean) / std_dev >= lower_threshold
        return under_upper & over_lower

custom_ge_df = ge.from_pandas(df, dataset_class=MyCustomPandasDataset)
custom_ge_df

Unnamed: 0,timestamp,temperature,relative_humidity,wind_speed,wind_direction
0,20160406T0000,11.710529,97.0,4.680000,270.000000
1,20160406T0100,11.180529,92.0,18.837322,296.075350
2,20160406T0200,10.690529,94.0,8.942214,310.100920
3,20160406T0300,10.330529,92.0,10.446206,271.974900
4,20160406T0400,9.890529,89.0,14.044615,271.468800
...,...,...,...,...,...
53253,20220503T1900,20.170528,36.0,7.568566,25.346160
53254,20220503T2000,19.360529,38.0,5.483357,23.198593
53255,20220503T2100,18.550530,42.0,5.241679,15.945404
53256,20220503T2200,16.950530,52.0,3.096837,35.537660


Custom expectations can now be called the same way that production level expectations are called

In [12]:
custom_ge_df.expect_column_longest_interval_of_same_values_to_be_under("relative_humidity", max_value=10, result_format="COMPLETE", mostly=0.99)

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 53258,
    "missing_count": 1,
    "missing_percent": 0.0018776521837094895,
    "unexpected_count": 2,
    "unexpected_percent": 0.0037553748802974254,
    "unexpected_percent_total": 0.003755304367418979,
    "unexpected_percent_nonmissing": 0.0037553748802974254,
    "partial_unexpected_list": [
      100.0,
      89.0
    ],
    "partial_unexpected_index_list": [
      16338,
      24118
    ],
    "partial_unexpected_counts": [
      {
        "value": 100.0,
        "count": 1
      },
      {
        "value": 89.0,
        "count": 1
      }
    ],
    "unexpected_list": [
      100.0,
      89.0
    ],
    "unexpected_index_list": [
      16338,
      24118
    ]
  },
  "success": true
}

An expectation suite is a "collection" of expectations. It can be saved ...

In [13]:
ge_df.get_expectation_suite(discard_failed_expectations=False)

{
  "ge_cloud_id": null,
  "expectations": [
    {
      "expectation_type": "expect_column_values_to_be_between",
      "meta": {},
      "kwargs": {
        "column": "temperature",
        "min_value": -20,
        "max_value": 50
      }
    },
    {
      "expectation_type": "expect_column_values_to_be_between",
      "meta": {},
      "kwargs": {
        "column": "relative_humidity",
        "min_value": 0,
        "max_value": 100,
        "strict_min": true,
        "strict_max": true
      }
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "meta": {},
      "kwargs": {
        "column": "relative_humidity"
      }
    },
    {
      "expectation_type": "expect_column_to_exist",
      "meta": {},
      "kwargs": {
        "column": "relative_humidity"
      }
    },
    {
      "expectation_type": "expect_column_stdev_to_be_between",
      "meta": {},
      "kwargs": {
        "column": "wind_speed",
        "min_value": 0,
        "max_value

In [14]:
ge_df.save_expectation_suite("exp_suite.json", discard_failed_expectations=False)

... and applied to any other datasets

In [15]:
iselin_df = pd.read_csv("data/weather_iselin.csv")
new_ge_df = ge.from_pandas(iselin_df, dataset_class=MyCustomPandasDataset)
new_ge_df

Unnamed: 0,timestamp,temperature,relative_humidity,wind_speed,wind_direction
0,20080101T0000,,,,
1,20080101T0100,,,,
2,20080101T0200,-0.083471,87.0,0.000000,180.00000
3,20080101T0300,1.926529,87.0,3.240000,89.99999
4,20080101T0400,1.636529,87.0,1.080000,180.00000
...,...,...,...,...,...
125707,20220504T1900,17.426529,52.0,9.957109,282.52880
125708,20220504T2000,16.776527,56.0,9.178235,281.30994
125709,20220504T2100,15.326529,62.0,8.707238,277.12500
125710,20220504T2200,14.316529,66.0,7.952660,264.80557


In [16]:
new_ge_df.validate(expectation_suite="exp_suite.json")

{
  "statistics": {
    "evaluated_expectations": 5,
    "successful_expectations": 2,
    "unsuccessful_expectations": 3,
    "success_percent": 40.0
  },
  "evaluation_parameters": {},
  "meta": {
    "great_expectations_version": "0.15.2",
    "expectation_suite_name": "default",
    "run_id": {
      "run_time": "2022-05-05T07:56:39.305753+00:00",
      "run_name": null
    },
    "batch_kwargs": {
      "ge_batch_id": "e4ebb562-cc48-11ec-af3a-98039b7fbb30"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20220505T075639.305695Z",
    "expectation_suite_meta": {
      "great_expectations_version": "0.15.2"
    }
  },
  "results": [
    {
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      },
      "expectation_config": {
        "expectation_type": "expect_column_values_to_be_between",
        "meta": {},
        "kwargs": {
          "column": "temperature",
   

Important thing to note: this way of generating an expectation suite overwrites the old expectations when a new one of the same type on the same column is added. 
If we add two of those same expectations on the same column, but with different values ...

In [17]:
new_ge_df.expect_column_values_to_be_between("relative_humidity", min_value=0, max_value=100)
new_ge_df.expect_column_values_to_be_between("relative_humidity", min_value=-5, max_value=105)

{
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 125712,
    "missing_count": 2,
    "missing_percent": 0.0015909380170548554,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "success": true
}

... then the first one gets overwritten by the newer one

In [18]:
new_ge_df.get_expectation_suite(discard_failed_expectations=False)

{
  "ge_cloud_id": null,
  "expectations": [
    {
      "expectation_type": "expect_column_values_to_be_between",
      "meta": {},
      "kwargs": {
        "column": "relative_humidity",
        "min_value": -5,
        "max_value": 105
      }
    }
  ],
  "expectation_suite_name": "default",
  "data_asset_type": "MyCustomPandasDataset",
  "meta": {
    "great_expectations_version": "0.15.2"
  }
}

But since an expectation suite is just a json file, one can easily build an expectation suite like this by directly writing/generating a json file with the appropriate format